-
Notifications
You must be signed in to change notification settings - Fork 144
Description
Describe the bug
Applying ops.GroupBy(...)
after ops.Filter(...)
causes some weird behaviour. Some rows are filled with lists of nan
s, and rows are not groupped correctly. It seems like the problem is with indexes.
A bug related to #1767
Steps/Code to reproduce bug
Sample code:
import pandas as pd
import nvtabular as nvt
# dummy data
_event_id = [0, 1, 2, 3]
_session = ["a", "a", "a", "b"]
_category = ["x", "x", "x", "y"]
_event_type = ["start", "start", "stop", "start"]
input_df = pd.DataFrame(
{"event_id": _event_id, "session": _session, "category": _category, "event_type": _event_type}
)
print(input_df.head())
# graph
cat_feats = ["category"] >> nvt.ops.Categorify()
features = ["event_id", "session", "event_type"] + cat_feats
features = features >> nvt.ops.Filter(f=lambda df: df["event_type"] == "start")
groupby_features = features >> nvt.ops.Groupby(
groupby_cols=["session"],
aggs={
"event_id": "list",
"category": ["list", "count"],
"event_type": ["list"],
},
)
processor = nvt.Workflow(groupby_features)
dataset = nvt.Dataset(input_df)
output_df = processor.fit_transform(dataset)
print(output_df.head())
input_df
looks like this:
event_id session category event_type
0 0 a x start
1 1 a x start
2 2 a x stop
3 3 b y start
And output_df
(after filter and groupby):
session event_id_list category_list event_type_list category_count
0 a [0.0, 1.0, 3.0] [3.0, 3.0, 4.0] [start, start, start] 3
1 b [nan] [nan] [None] 0
Expected behavior
Expected output_df
should look like this:
session event_id_list category_list event_type_list category_count
0 a [0, 1] [3, 3] [start, start] 2
1 b [3] [4] [start] 1
The event with event_id == 3
should be assigned to the session b
, not a
.
Dtype of columns event_id_list
and category_list
should be lists of ints not floats
Environment details (please complete the following information):
- Environment location: docker container (from nvidia/cuda:11.8.0-devel-ubi8)
- Method of NVTabular install: mamba
- nvtabular version: 23.8.0
Additional context
Related issue #1767 was about TypeError
. In the output_df
you can see, that the category_list
column contains lists of floats (categories should be ints after ops.Categorify
) so they were converted in order to avoid TypeError
.
I believe, that only the symptom of a bug was fixed there and not the cause. I think TypeError
was an indirect result of the bug I describe in this issue. Since GroupBy
causes some rows to be nan
s, there was a type conflict between original values (ints) and the nans (floats). But the real problem is that GroupBy
after Filter
messes up indexing and create some empty rows.