Skip to content

PermutationImportance error with XGBoost and NaNs - ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). (with a fix) #262

@ianozsvald

Description

@ianozsvald

Using the current version of XGBoost and ELI5 if I add NaN values to X, whilst show_weights works fine PermutationImportance throws an error:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

To recreate:

import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import eli5
from eli5.sklearn import PermutationImportance

%load_ext watermark
%watermark -d -m -v -p numpy,sklearn,eli5,xgboost,pandas
#2018-05-03 
#CPython 3.6.5
#IPython 6.3.1
#numpy 1.14.2
#sklearn 0.19.1
#eli5 0.8
#xgboost 0.71
#pandas 0.22.0
#compiler   : GCC 4.8.2 20140120 (Red Hat 4.8.2-15)
#system     : Linux
#release    : 4.9.91-040991-generic
#machine    : x86_64
#processor  : x86_64
#CPU cores  : 8
#interpreter: 64bit

# 8 items of data, pairs of useless feature and predictive feature
X_np = np.array([[np.nan, 1,], [0, 1], [0, 1], [0, 1], [0, 1], [0, 2,], [0, 2,], [0, 2,], [0, 2,], [0, 2]])
y_np = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

# if we have 10 items (prepared above) XGBClassifier won't fit (but RandomForestClassifer does)
# so the score is 0. If we concatenate to make "more data" (30 items in total) then XGBClassifier
# fits with 100% 
X = np.concatenate((X_np, X_np, X_np))
y = np.concatenate((y_np, y_np, y_np))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
print("X y shapes:", X_train.shape, y_train.shape, X_test.shape, y_test.shape) # (15, 2) (15,) (15, 2) (15,)

est = XGBClassifier()
est.fit(X_train, y_train)
print("Classifier score (should be 1.0):", est.score(X_test, y_test))

perm = PermutationImportance(est)
perm.fit(X_test, y_test)
eli5.show_weights(perm)
#X y shapes: (15, 2) (15,) (15, 2) (15,)
#Classifier score (should be 1.0): 1.0

~/anaconda3/envs/debug_xgb_pandas_eli5/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X)
     42             and not np.isfinite(X).all()):
     43         raise ValueError("Input contains NaN, infinity"
---> 44                          " or a value too large for %r." % X.dtype)
     45 
     46 

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

The call to check_array is using sklearn's constraints and disallows NaN. XGBoost is ok with NaN. My modification (monkey patched here for easy testing) is to call check_array(X, force_all_finite=False):

from sklearn.metrics.scorer import check_scoring  # type: ignore
from sklearn.utils import check_array, check_random_state  # type: ignore

def fit(self, X, y, groups=None, **fit_params):
    # type: (...) -> PermutationImportance
    """Compute ``feature_importances_`` attribute and optionally
    fit the base estimator.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        The training input samples.

    y : array-like, shape (n_samples,)
        The target values (integers that correspond to classes in
        classification, real numbers in regression).

    groups : array-like, with shape (n_samples,), optional
        Group labels for the samples used while splitting the dataset into
        train/test set.

    **fit_params : Other estimator specific parameters

    Returns
    -------
    self : object
        Returns self.
    """
    self.scorer_ = check_scoring(self.estimator, scoring=self.scoring)

    if self.cv != "prefit" and self.refit:
        self.estimator_ = clone(self.estimator)
        self.estimator_.fit(X, y, **fit_params)

    X = check_array(X, force_all_finite=False) 
    #X = check_array(X)

    if self.cv not in (None, "prefit"):
        si = self._cv_scores_importances(X, y, groups=groups, **fit_params)
    else:
        si = self._non_cv_scores_importances(X, y)
    scores, results = si
    self.scores_ = np.array(scores)
    self.results_ = results
    self.feature_importances_ = np.mean(results, axis=0)
    self.feature_importances_std_ = np.std(results, axis=0)
    return self

PermutationImportance.fit = fit
perm = PermutationImportance(est)
perm.fit(X_test, y_test)
eli5.show_weights(perm)
# no errors, reports perm results just fine

It might be wise to try testing for the use of XGB vs sklearn and then force_all_finite could be flipped to preserve the sklearn interpretation?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions