A framework for evaluating the effectiveness of chain-of-thought reasoning in language models.
Table of Contents
Set up a pipeline (and provide missing parts) to evaluate the effectiveness of chain-of-thought reasoning (COT) in language models.
COT-eval is intended to be used in conjunction with Eleuther's lm-evaluation-harness (or similiar packages, such as catwalk) to assess a model's ability to generate high quality (i.e., effective) chain-of-thought reasoning traces.
The pipeline is as follows:
- Specify an eval configuration, including
model: the model to evaluate (e.g. mistralai/Mistral-7B-Instruct-v0.2)task: the task to evaluate on (logiqa, lsat)chain: the prompt chain used to generate the reasoning tracesdecoding: the decoding strategy and parameters to use for reasoning (beam search, temperature, etc.)
- Pertubate the
task. (Because of potential training data contamination.) - Run
cot-evalto generate the reasoning traces with themodel(and according to the configuration) for the perturbatedtask. (Push reasoning traces to HF hub.) - Run
lm-evaluation-harnessto evaluate themodelon the originaltask. This gives usscores-1. - Run
lm-evaluation-harnessto evaluate themodelon the perturbatedtask. This gives usscores-2. - Run
lm-evaluation-harnessto evaluate themodelon the perturbatedtaskwith added reasoning traces. This gives usscores-3. - Conclude:
- The difference between
scores-1andscores-2is an indicator of training data contamination. - The difference between
scores-2andscores-3is an indicator of COT effectiveness, i.e. themodel's reasoning skill.
- The difference between
git clone https://github.com/logikon-ai/cot-eval.git
cd cot-eval
pip install -e ".[cuda]"Note
Use a personal HUGGINGFACEHUB_API_TOKEN. Note that you have to be a member of the Open CoT Leaderboard for this to work.
See run.sh for an implementation of the pipeline.
cot-eval --helpStep 1. Clone cot-eval repo.
git clone https://github.com/logikon-ai/cot-eval.git
cd cot-evalStep 2. Pull docker image
docker pull logikon/cot-eval:latestStep 2a. (Alternatively:) Build docker image locally (allows you to adapt build args, e.g. VLLM_VERSION)
docker build --no-cache -t cot-eval --build-arg="VLLM_VERSION=0.3.0" . # change vllm version if necessaryStep 3. Set parameters and arguments
vim config.env # adapt config.env, set especially NEXT_MODEL_PATH="..." and HUGGINGFACEHUB_API_TOKEN="..."Step 4. Run docker container
cat config.env # check
docker run -it --rm --gpus all --ipc=host --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --env-file config.env logikon/cot-eval:latest# export TMPDIR=...
cd $TMPDIR
git clone https://github.com/logikon-ai/cot-eval.git
# edit config
vim cot-eval/config.env
export ENROOT_DATA_PATH=$TMPDIR/enroot-data
mkdir $ENROOT_DATA_PATH
export ENROOT_CONFIG_PATH=$TMPDIR/enroot-config
mkdir $ENROOT_CONFIG_PATH
touch $ENROOT_CONFIG_PATH/enroot.config
mkdir $ENROOT_CONFIG_PATH/environ.d
cp cot-eval/config.env $ENROOT_CONFIG_PATH/environ.d
enroot import docker://logikon/cot-eval
enroot create --name cot-eval logikon+cot-eval.sqsh
rm logikon+cot-eval.sqsh
enroot start --rw cot-evalAlternatively:
ENROOT_SQUASH_OPTIONS='-comp lz4 -noD' enroot import docker://logikon/cot-eval
enroot start --rw logikon+cot-eval.sqshWe're using the following slurm on booster:
#!/bin/bash -x
#SBATCH --account=<PROJECT_ID>
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=4
#SBATCH --output=gpu-out.%j
#SBATCH --error=gpu-err.%j
#SBATCH --time=12:00:00
#SBATCH --partition=booster
#SBATCH --gres=gpu:1
jutil env activate -p <PROJECT_ID>
# create tmp folder to bind with container
mkdir -p $SCRATCH/$SLURM_JOB_ID
apptainer run \
--nv \
--env HF_HOME=/mnt/cache/huggingface \
--env-file $PROJECT/config.env \
--no-mount home,cwd \
--bind $SCRATCH/$SLURM_JOB_ID:/mnt \
--containall \
$PROJECT/cot-eval.sif bash -c "mkdir /mnt/cache;mkdir /mnt/cache/huggingface;cd /workspace/cot-eval;bash run.sh"git clone https://github.com/logikon-ai/cot-eval.git
cd cot-eval
docker build --no-cache -t cot-eval .
docker login --username logikon
docker tag cot-eval logikon/cot-eval:latest
docker push logikon/cot-eval:latestcot-eval is distributed under the terms of the MIT license.