Skip to content

[BUG] - Default embedding dimensions for categorical features often leads to embedding collapse #1890

@hkristof03

Description

@hkristof03

Describe the bug
The default method that computes the embedding dimensions for categorical features based on their cardinalities often leads to embedding collapse, when the subsequent Two Tower model recommends the same list always. Stated otherwise, the computed embedding dimensions usually doesn't make any sense. Examples are when a dimension higher than the feature's cardinality is assigned to the feature, or a high embedding dimension is assigned compared to the cardinality of the feature, which results in very low compression ratio.

Steps/Code to reproduce bug

import nvtabular as nvt
from nvtabular import ops
import numpy as np
import pandas as pd


def generate_categorical_data(cardinalities, num_rows=1_000_000, seed=2025):
    
    np.random.seed(seed)

    data = {}
    
    for i, cardinality in enumerate(cardinalities, 1):
        feature_name = f'feature_{i}'
        data[feature_name] = np.random.randint(0, cardinality, size=num_rows)
    
    return pd.DataFrame(data)

cardinalities = [3, 5, 12, 20, 29, 50, 80, 230, 760, 1100, 4679, 8900]

df = generate_categorical_data(cardinalities)

for idx, col in enumerate(df.columns):
    print(cardinalities[idx], df[col].nunique())

cat_features = nvt.ColumnSelector(df.columns) >> ops.Categorify(freq_threshold=1, dtype='int32')

workflow = nvt.Workflow(cat_features)

ds = nvt.Dataset(df)
ds_tr = workflow.fit_transform(ds)

schema = ds_tr.schema

for col in schema.column_names:
    card = schema[col].properties['embedding_sizes']['cardinality']
    dim = schema[col].properties['embedding_sizes']['dimension']
    print(f'{col} cardinality: {card} dimension: {dim}')

Expected behavior
The snippet above will print out:

feature_1 cardinality: 6 dimension: 16
feature_2 cardinality: 8 dimension: 16
feature_3 cardinality: 15 dimension: 16
feature_4 cardinality: 23 dimension: 16
feature_5 cardinality: 32 dimension: 16
feature_6 cardinality: 53 dimension: 16
feature_7 cardinality: 83 dimension: 19
feature_8 cardinality: 233 dimension: 34
feature_9 cardinality: 763 dimension: 66
feature_10 cardinality: 1103 dimension: 81
feature_11 cardinality: 4682 dimension: 182
feature_12 cardinality: 8903 dimension: 261

1. Does it make sense that the resulting dimension is higher than the feature's cardinality?
2. I would also argue that assigning an embedding dimension of e.g. 81 to a feature with a cardinality of 1103 could very easily cause an embedding collapse.

Environment details (please complete the following information):

  • Environment location:

https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-tensorflow-training

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions