Reproducing Experiments

SplitLight uses Hydra-based configs and command-line overrides to run preprocessing, data splitting, and model training. All experiments in the paper can be reproduced by varying a small number of parameters at each stage.

Preprocessing
Splitting
How to run model training

1. Preprocessing

Preprocessing constructs user interaction sequences, applies standard filtering (e.g. 5-core filtering) and prepares data for splitting.

Example:

python runs/preprocess.py \
  +dataset=Movielens-1m

Preprocessing parameters

The table below lists all available preprocessing parameters that should be used under prep_params key.

Parameter	Default	Description
`item_min_count`	`5`	Minimum number of interactions per item (k-core filtering).
`seq_min_len`	`5`	Minimum user sequence length after filtering.
`core`	`True`	Enables iterative k-core filtering on users and items.
`encoding`	`True`	Encodes users and items into contiguous integer IDs.
`drop_consec_repeats`	`True`	Removes consecutive duplicate items in user sequences. Set to `False` to keep repeats.
`filter_by_relevance`	`False`	Optional relevance-based filtering (not used).
`save_to_disk`	`True`	Saves preprocessed data to disk.
`shuffle_collision`	`False`	Randomly shuffles interactions with identical timestamps.
`users_sample`	`null`	Optional subsampling of users (integer or fraction of all users).
`seed`	`42`	Random seed (used for timestamp shuffling and user sampling).

All preprocessing options are specified in the dedicated config

Examples:

# Shuffle equal timestamps
python runs/preprocess.py \
  dataset=Movielens-1m \
  prep_params.shuffle_collision=True \
  tag=shuffle

The optional tag argument is used to distinguish derived datasets on disk and must be reused at the splitting stage.

2. Splitting

Splitting is performed after preprocessing.

Got it — we’ll describe split_type separately, in prose, without including it in the table:

Splitting is performed after preprocessing.

split_type: This is a top-level configuration key that specifies the overall splitting strategy. Available options are:
- global_timesplit – splits data based on a global time quantile, separating older interactions for training and newer ones for testing.
- leave-one-out – reserves the last interaction of each user as the test set, with the rest used for training.

All split-related parameters are defined under the split_params key.

Parameter	Default	Description
`quantile`	`0.9`	Global time quantile used to separate train/test (e.g. 0.8, 0.9, 0.95, 0.975).
`validation_quantile`	`${quantile}`	Quantile used to form validation after the test split.
`validation_type`	`by_user`	Validation strategy: `by_user`, `by_time`, `last_train_item`.
`validation_size`	`100`	Number of validation users (only for `by_user`).
`remove_cold_users`	`False`	Removes users unseen in training.
`remove_cold_items`	`True`	Removes items unseen in training. Set to `False` to keep cold items.
`target_type`	`first`	Prediction target for GTS: `first`, `last`, or `all`.
`tag`	`null`	Identifies preprocessing variant to split (e.g. `shuffle`).

Complete data splitting parameters are defined in the split config

Example: Leave-One-Out (LOO)

python runs/split.py \
  dataset=Movielens-1m \
  split_type=leave-one-out

Example: Global Time Split (GTS)

Last-item prediction with time-based validation:

python runs/split.py \
  dataset=Movielens-1m \
  split_type=global_timesplit \
  split_params.validation_type=by_time \
  split_params.target_type=last

Splitting configuration used

We applied the following data splits for model training and evaluation:

LOO with cold items filtering split_params.remove_cold_items=True. (refer to leave-one-out-no_cold_items in /data/<DatasetName>/)
Global Time Split (GTS) with cold items removal split_params.remove_cold_items=True and Last item target split_params.target_type=last. The last item from holdout subset become target, all before go to input, cold items are removed from both inputs and targets. Target without inputs and vise versa are removed. (refer to GTS-q09-val_by_time-target_last-no_cold_items in /data/<DatasetName>/)

For cold users and item analysis we applied Global Time Split (GTS) with All items target: split_params.target_type=all, split_params.remove_cold_items=False. It preserves cold users and items and leaves targets with no inputs (cold users in case of All items target). (refer to GTS-q09-val_by_time-target_all in /data/<DatasetName>/)

Example: Using alternative preprocessing variants

To run splits on modified preprocessing outputs (e.g. shuffled timestamps), pass the tag assigned to required preprocessing setup on preprocessing stage:

python runs/split.py \
  dataset=Zvuk \
  split_type=leave-one-out \
  tag=shuffle

3. Model Training

Training is performed with runs/train_rs.py. Each run specifies several parameters:

Parameter	Default	Description
General / Experiment
`dataset`	`Beauty`	Dataset name
`model`	`SASRec`	Model name: `SASRec`, `RNN`, `BERT4Rec`
`split_name`	`leave-one-out`	Name of data split directory
`random_state`	`42`	Seed for reproducibility
`metrics_save_dir`	`results`	Directory to save metrics
`load_if_possible`	`True`	Load existing checkpoints if available

Full model and training parameters are defined in the training config

Example: training SASRec with a single split and seed:

python runs/train_rs.py -m \
  dataset=Movielens-1m \
  model=SASRec \
  split_name=leave-one-out \
  random_state=0

Multiple splits and random seeds

Hydra supports comma-separated values:

python runs/train_rs.py -m \
  dataset=Movielens-1m \
  model=SASRec \
  split_name=\
leave-one-out,\
leave-one-out-no_cold_items,\
GTS-q0.9-val_by_time-target_last \
  random_state=0,1,2,3,4

Each (split_name, random_state) combination is trained independently.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducing Experiments

1. Preprocessing

Preprocessing parameters

2. Splitting

Example: Leave-One-Out (LOO)

Example: Global Time Split (GTS)

Splitting configuration used

Example: Using alternative preprocessing variants

3. Model Training

Multiple splits and random seeds

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Reproducing Experiments

1. Preprocessing

Preprocessing parameters

2. Splitting

Example: Leave-One-Out (LOO)

Example: Global Time Split (GTS)

Splitting configuration used

Example: Using alternative preprocessing variants

3. Model Training

Multiple splits and random seeds