Skip to content

[BUG] ops.GroupBy after ops.Filter fails to group correctly, and produces unexpected NaNs #1886

@matib99

Description

@matib99

Describe the bug

Applying ops.GroupBy(...) after ops.Filter(...) causes some weird behaviour. Some rows are filled with lists of nans, and rows are not groupped correctly. It seems like the problem is with indexes.

A bug related to #1767

Steps/Code to reproduce bug
Sample code:

import pandas as pd
import nvtabular as nvt

# dummy data
_event_id = [0, 1, 2, 3]
_session = ["a", "a", "a", "b"]
_category = ["x", "x", "x", "y"]
_event_type = ["start", "start", "stop", "start"]
input_df = pd.DataFrame(
    {"event_id": _event_id, "session": _session, "category": _category, "event_type": _event_type}
)
print(input_df.head())

# graph
cat_feats = ["category"] >> nvt.ops.Categorify()

features = ["event_id", "session", "event_type"] + cat_feats

features = features >> nvt.ops.Filter(f=lambda df: df["event_type"] == "start")

groupby_features = features >> nvt.ops.Groupby(
    groupby_cols=["session"],
    aggs={
        "event_id": "list",
        "category": ["list", "count"],
        "event_type": ["list"],
    },
)

processor = nvt.Workflow(groupby_features)
dataset = nvt.Dataset(input_df)

output_df = processor.fit_transform(dataset)
print(output_df.head())

input_df looks like this:

   event_id session category event_type
0         0       a        x      start
1         1       a        x      start
2         2       a        x       stop
3         3       b        y      start

And output_df (after filter and groupby):

  session    event_id_list    category_list        event_type_list     category_count  
0       a  [0.0, 1.0, 3.0]  [3.0, 3.0, 4.0]  [start, start, start]                  3 
1       b            [nan]            [nan]                 [None]                  0 

Expected behavior
Expected output_df should look like this:

  session event_id_list category_list       event_type_list  category_count
0       a        [0, 1]        [3, 3]        [start, start]               2
1       b           [3]           [4]               [start]               1

The event with event_id == 3 should be assigned to the session b, not a.
Dtype of columns event_id_list and category_list should be lists of ints not floats

Environment details (please complete the following information):

  • Environment location: docker container (from nvidia/cuda:11.8.0-devel-ubi8)
  • Method of NVTabular install: mamba
  • nvtabular version: 23.8.0

Additional context

Related issue #1767 was about TypeError. In the output_df you can see, that the category_list column contains lists of floats (categories should be ints after ops.Categorify ) so they were converted in order to avoid TypeError.

I believe, that only the symptom of a bug was fixed there and not the cause. I think TypeError was an indirect result of the bug I describe in this issue. Since GroupBy causes some rows to be nans, there was a type conflict between original values (ints) and the nans (floats). But the real problem is that GroupBy after Filter messes up indexing and create some empty rows.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions