PermutationImportance error with XGBoost and NaNs - `ValueError: Input contains NaN, infinity or a value too large for dtype('float64').` (with a fix)

Using the current version of `XGBoost` and `ELI5` if I add `NaN` values to `X`, whilst `show_weights` works fine `PermutationImportance` throws an error:
```
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
```

To recreate:
```
import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import eli5
from eli5.sklearn import PermutationImportance

%load_ext watermark
%watermark -d -m -v -p numpy,sklearn,eli5,xgboost,pandas
#2018-05-03 
#CPython 3.6.5
#IPython 6.3.1
#numpy 1.14.2
#sklearn 0.19.1
#eli5 0.8
#xgboost 0.71
#pandas 0.22.0
#compiler   : GCC 4.8.2 20140120 (Red Hat 4.8.2-15)
#system     : Linux
#release    : 4.9.91-040991-generic
#machine    : x86_64
#processor  : x86_64
#CPU cores  : 8
#interpreter: 64bit

# 8 items of data, pairs of useless feature and predictive feature
X_np = np.array([[np.nan, 1,], [0, 1], [0, 1], [0, 1], [0, 1], [0, 2,], [0, 2,], [0, 2,], [0, 2,], [0, 2]])
y_np = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

# if we have 10 items (prepared above) XGBClassifier won't fit (but RandomForestClassifer does)
# so the score is 0. If we concatenate to make "more data" (30 items in total) then XGBClassifier
# fits with 100% 
X = np.concatenate((X_np, X_np, X_np))
y = np.concatenate((y_np, y_np, y_np))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
print("X y shapes:", X_train.shape, y_train.shape, X_test.shape, y_test.shape) # (15, 2) (15,) (15, 2) (15,)

est = XGBClassifier()
est.fit(X_train, y_train)
print("Classifier score (should be 1.0):", est.score(X_test, y_test))

perm = PermutationImportance(est)
perm.fit(X_test, y_test)
eli5.show_weights(perm)
#X y shapes: (15, 2) (15,) (15, 2) (15,)
#Classifier score (should be 1.0): 1.0

~/anaconda3/envs/debug_xgb_pandas_eli5/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X)
     42             and not np.isfinite(X).all()):
     43         raise ValueError("Input contains NaN, infinity"
---> 44                          " or a value too large for %r." % X.dtype)
     45 
     46 

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
```

The call to `check_array` is using `sklearn`'s constraints and disallows `NaN`. XGBoost is ok with `NaN`. My modification (monkey patched here for easy testing) is to call `check_array(X, force_all_finite=False)`:
```
from sklearn.metrics.scorer import check_scoring  # type: ignore
from sklearn.utils import check_array, check_random_state  # type: ignore

def fit(self, X, y, groups=None, **fit_params):
    # type: (...) -> PermutationImportance
    """Compute ``feature_importances_`` attribute and optionally
    fit the base estimator.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        The training input samples.

    y : array-like, shape (n_samples,)
        The target values (integers that correspond to classes in
        classification, real numbers in regression).

    groups : array-like, with shape (n_samples,), optional
        Group labels for the samples used while splitting the dataset into
        train/test set.

    **fit_params : Other estimator specific parameters

    Returns
    -------
    self : object
        Returns self.
    """
    self.scorer_ = check_scoring(self.estimator, scoring=self.scoring)

    if self.cv != "prefit" and self.refit:
        self.estimator_ = clone(self.estimator)
        self.estimator_.fit(X, y, **fit_params)

    X = check_array(X, force_all_finite=False) 
    #X = check_array(X)

    if self.cv not in (None, "prefit"):
        si = self._cv_scores_importances(X, y, groups=groups, **fit_params)
    else:
        si = self._non_cv_scores_importances(X, y)
    scores, results = si
    self.scores_ = np.array(scores)
    self.results_ = results
    self.feature_importances_ = np.mean(results, axis=0)
    self.feature_importances_std_ = np.std(results, axis=0)
    return self

PermutationImportance.fit = fit
perm = PermutationImportance(est)
perm.fit(X_test, y_test)
eli5.show_weights(perm)
# no errors, reports perm results just fine
```

It might be wise to try testing for the use of `XGB` vs `sklearn` and then `force_all_finite` could be flipped to preserve the `sklearn` interpretation?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PermutationImportance error with XGBoost and NaNs - `ValueError: Input contains NaN, infinity or a value too large for dtype('float64').` (with a fix) #262

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PermutationImportance error with XGBoost and NaNs - ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). (with a fix) #262

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

PermutationImportance error with XGBoost and NaNs - `ValueError: Input contains NaN, infinity or a value too large for dtype('float64').` (with a fix) #262