-
Notifications
You must be signed in to change notification settings - Fork 144
Description
Describe the bug
The default method that computes the embedding dimensions for categorical features based on their cardinalities often leads to embedding collapse, when the subsequent Two Tower model recommends the same list always. Stated otherwise, the computed embedding dimensions usually doesn't make any sense. Examples are when a dimension higher than the feature's cardinality is assigned to the feature, or a high embedding dimension is assigned compared to the cardinality of the feature, which results in very low compression ratio.
Steps/Code to reproduce bug
import nvtabular as nvt
from nvtabular import ops
import numpy as np
import pandas as pd
def generate_categorical_data(cardinalities, num_rows=1_000_000, seed=2025):
np.random.seed(seed)
data = {}
for i, cardinality in enumerate(cardinalities, 1):
feature_name = f'feature_{i}'
data[feature_name] = np.random.randint(0, cardinality, size=num_rows)
return pd.DataFrame(data)
cardinalities = [3, 5, 12, 20, 29, 50, 80, 230, 760, 1100, 4679, 8900]
df = generate_categorical_data(cardinalities)
for idx, col in enumerate(df.columns):
print(cardinalities[idx], df[col].nunique())
cat_features = nvt.ColumnSelector(df.columns) >> ops.Categorify(freq_threshold=1, dtype='int32')
workflow = nvt.Workflow(cat_features)
ds = nvt.Dataset(df)
ds_tr = workflow.fit_transform(ds)
schema = ds_tr.schema
for col in schema.column_names:
card = schema[col].properties['embedding_sizes']['cardinality']
dim = schema[col].properties['embedding_sizes']['dimension']
print(f'{col} cardinality: {card} dimension: {dim}')
Expected behavior
The snippet above will print out:
feature_1 cardinality: 6 dimension: 16
feature_2 cardinality: 8 dimension: 16
feature_3 cardinality: 15 dimension: 16
feature_4 cardinality: 23 dimension: 16
feature_5 cardinality: 32 dimension: 16
feature_6 cardinality: 53 dimension: 16
feature_7 cardinality: 83 dimension: 19
feature_8 cardinality: 233 dimension: 34
feature_9 cardinality: 763 dimension: 66
feature_10 cardinality: 1103 dimension: 81
feature_11 cardinality: 4682 dimension: 182
feature_12 cardinality: 8903 dimension: 261
1. Does it make sense that the resulting dimension is higher than the feature's cardinality?
2. I would also argue that assigning an embedding dimension of e.g. 81 to a feature with a cardinality of 1103 could very easily cause an embedding collapse.
Environment details (please complete the following information):
- Environment location:
https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-tensorflow-training