Skip to content

[BUG] NVtabular.dataset.to_parquet(...) Improperly matched output dtypes detected in time, object and datetime64[ns] #1883

@Zachacy

Description

@Zachacy

I tried run NVIDIA Merlin on Microsoft’s News Dataset (MIND) tutorial ...
In running to Step 5: Feature Engineering - time-based features happened error:

data_train = nvt.Dataset(os.path.join(data_input_path, "train.parquet"), engine="parquet",part_size="256MB")
data_valid = nvt.Dataset(os.path.join(data_input_path, "valid.parquet"), engine="parquet",part_size="256MB")

dict_dtypes={}
for col in cat_features.columns:
    dict_dtypes[col] = np.int64

for col in cont_features.columns:
    dict_dtypes[col] = np.float32

for col in labels:
    dict_dtypes[col] = np.float32
%%time
proc.fit(data_train)

%%time

**proc.transform(data_train).to_parquet**(output_path= output_train_path, ## <- this line error
                                shuffle=nvt.io.Shuffle.PER_PARTITION,
                                dtypes=dict_dtypes,
                                out_files_per_proc=10,
                                cats = cat_features.columns,
                                conts = cont_features.columns,
                                labels = labels)

/core/merlin/io/dataset.py:863: UserWarning: Only created 1 files did not have enough partitions to create 10 files.
warnings.warn(
/usr/local/lib/python3.8/dist-packages/cudf/core/dataframe.py:1253: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(
/usr/local/lib/python3.8/dist-packages/cudf/core/dataframe.py:1253: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(
Failed to transform operator <nvtabular.ops.lambdaop.LambdaOp object at 0x7fa63bd86a00>
Traceback (most recent call last):
File "/nvtabular/nvtabular/workflow/workflow.py", line 485, in _transform_partition
raise TypeError(
TypeError: Improperly matched output dtypes detected in time, object and datetime64[ns]
distributed.worker - WARNING - Compute Failed
Function: _write_subgraph
args: (<merlin.io.dask.DaskSubgraph object at 0x7fa68c63f6d0>, ('part_0.parquet', 'part_1.parquet', 'part_2.parquet', 'part_3.parquet', 'part_4.parquet', 'part_5.parquet', 'part_6.parquet', 'part_7.parquet', 'part_8.parquet', 'part_9.parquet'), '/share/recommenders/MIND/processed_nvt/train', <Shuffle.PER_PARTITION: 0>, <fsspec.implementations.local.LocalFileSystem object at 0x7fa76da543a0>, ['time_hour', 'hist_cat_0', 'hist_subcat_0', 'hist_cat_1', 'hist_subcat_1', 'hist_cat_2', 'hist_subcat_2', 'hist_cat_3', 'hist_subcat_3', 'hist_cat_4', 'hist_subcat_4', 'hist_cat_5', 'hist_subcat_5', 'hist_cat_6', 'hist_subcat_6', 'hist_cat_7', 'hist_subcat_7', 'hist_cat_8', 'hist_subcat_8', 'hist_cat_9', 'hist_subcat_9', 'impr_cat', 'impr_subcat', 'impression_id', 'uid', 'time_minute', 'time_second', 'time_wd', 'time_day', 'time_day_week', 'time'], ['hist_count'], ['label'], 'parquet', 0, False, '')
kwargs: {}
Exception: "TypeError('Improperly matched output dtypes detected in time, object and datetime64[ns]')"

I environment refer [merlin-training:22.04]

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions