oegedijk
diff --git a/‎README.md‎
Lines changed: 16 additions & 0 deletions b/‎README.md‎
Lines changed: 16 additions & 0 deletions
diff --git a/‎RELEASE_NOTES.md‎
Lines changed: 9 additions & 0 deletions b/‎RELEASE_NOTES.md‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎TODO.md‎
Lines changed: 1 addition & 0 deletions b/‎TODO.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/source/deployment.rst‎
Lines changed: 68 additions & 65 deletions b/‎docs/source/deployment.rst‎
Lines changed: 68 additions & 65 deletions
@@ -168,6 +168,18 @@ db = ExplainerDashboard(explainer,
 db.run(port=8050)
 ```
 
+If you are passing an sklearn/imblearn `Pipeline`, you can also clean up transformed
+feature names and let the explainer infer onehot groups automatically:
+
+```python
+explainer = ClassifierExplainer(
+    pipeline_model, X_test, y_test,
+    strip_pipeline_prefix=True,      # e.g. "num__Age" -> "Age"
+    feature_name_fn=None,            # optional custom rename function
+    auto_detect_pipeline_cats=True,  # infer cats from transformed pipeline output
+)
+```
+
 For a regression model you can also pass the units of the target variable (e.g.
 dollars):
 
@@ -184,6 +196,10 @@ explainer = RegressionExplainer(model, X_test, y_test,
 ExplainerDashboard(explainer).run()
 ```
 
+For pipeline-based models with post-processing/scaling, grouped categorical
+features passed through `cats` are now accepted as long as encoded columns are
+binary-like (not strictly only `0/1`).
+
 `y_test` is actually optional, although some parts of the dashboard like performance
 metrics will obviously not be available: `ExplainerDashboard(ClassifierExplainer(model, X_test)).run()`.
 
 
@@ -16,6 +16,15 @@
 - Add CatBoost regression tests for classifier/regression `pdp_df(...)` with `X_row` containing missing categorical values.
 - Add hub regression test for integrated hub yaml serialization to verify `pickle_type` is preserved and explainer artifacts are written.
 - Add regression tests for issue #294 covering multiclass logodds consistency across prediction table, contributions, PDP highlight predictions, and XGBoost decision-path summaries.
+- Add pipeline tests for transformed feature-name cleanup (`strip_pipeline_prefix`, `feature_name_fn`) and pipeline categorical grouping autodetection.
+- Add explainer-method unit tests for binary-like onehot detection, transformed feature-name deduping, inferred pipeline cats, and pipeline extraction warning text.
+
+### Improvements
+- Add pipeline feature-name cleanup options: `strip_pipeline_prefix=True` and `feature_name_fn=...` for sklearn/imblearn pipeline transformed output columns.
+- Add optional `auto_detect_pipeline_cats=True` to infer onehot groups from transformed pipeline columns when `cats` is not provided.
+- Preserve input index in transformed pipeline dataframes produced during pipeline extraction.
+- Improve pipeline extraction warning guidance and include concrete checks (`get_feature_names_out`, transform compatibility on `X`/`X_background`).
+- Relax onehot grouping validation to also accept binary-like scaled onehot columns (not only strict `0/1`) when parsing `cats`.
 
 ### CI
 - Update `explainerdashboard` GitHub Actions workflow to run a weekly scheduled full test suite (`pytest`) to detect dependency breakages earlier.
 
@@ -11,6 +11,7 @@
 - [M][Explainers][#198/#340] LightGBM string categorical handling across SHAP/plots.
 - [S][Hub][#146/#342] hub.to_yaml integrate_dashboard_yamls honors pickle_type and dumps integrated explainer artifacts.
 - [M][Explainers][#294] align/explain multiclass logodds between Contributions Plot and Prediction Box (+ PDP highlight and XGBoost decision path wording alignment).
+- [M][Explainers/Methods/Docs][#213] improve sklearn/imblearn pipeline support: feature-name cleanup (`strip_pipeline_prefix`, `feature_name_fn`), auto-detect onehot groups (`auto_detect_pipeline_cats`), accept binary-like scaled onehot columns in `cats`, preserve transformed index, add warnings/docs/tests.
 
 **Now**
 - [M][Explainers][#118] add LightGBM tree visualization support (dtreeviz).
 
@@ -3,31 +3,31 @@ Deployment
 
 When deploying your dashboard it is better not to use the built-in flask
 development server but use a more robust production server like ``gunicorn`` or ``waitress``.
-Probably `gunicorn <https://gunicorn.org/>`_ is a bit more fully featured and 
+Probably `gunicorn <https://gunicorn.org/>`_ is a bit more fully featured and
 faster but only works on unix/linux/osx, whereas
-`waitress <https://docs.pylonsproject.org/projects/waitress/en/stable/>`_ also works 
-on Windows and has very minimal dependencies. 
+`waitress <https://docs.pylonsproject.org/projects/waitress/en/stable/>`_ also works
+on Windows and has very minimal dependencies.
 
-Install with either ``pip install gunicorn`` or ``pip install waitress``. 
+Install with either ``pip install gunicorn`` or ``pip install waitress``.
 
 Storing explainer and running default dashboard with gunicorn
 =============================================================
 
-Before you start a dashboard with gunicorn you need to store both the explainer 
+Before you start a dashboard with gunicorn you need to store both the explainer
 instance and and a configuration for the dashboard::
 
     from explainerdashboard import ClassifierExplainer, ExplainerDashboard
 
     explainer = ClassifierExplainer(model, X, y)
-    db = ExplainerDashboard(explainer, title="Cool Title", shap_interaction=False) 
+    db = ExplainerDashboard(explainer, title="Cool Title", shap_interaction=False)
     db.to_yaml("dashboard.yaml", explainerfile="explainer.joblib", dump_explainer=True)
 
 Now you re-load your dashboard and expose a flask server as ``app`` in ``dashboard.py``::
 
     from explainerdashboard import ExplainerDashboard
 
     db = ExplainerDashboard.from_config("dashboard.yaml")
-    app = db.flask_server() 
+    app = db.flask_server()
 
 
 .. highlight:: bash
@@ -36,13 +36,13 @@ If you named the file above ``dashboard.py``, you can now start the gunicorn ser
 
     $ gunicorn dashboard:app
 
-If you want to run the server server with for example three workers, binding to 
+If you want to run the server server with for example three workers, binding to
 port ``8050`` you launch gunicorn with::
 
     $ gunicorn -w 3 -b localhost:8050 dashboard:app
 
-If you now point your browser to ``http://localhost:8050`` you should see your dashboard. 
-Next step is finding a nice url in your organization's domain, and forwarding it 
+If you now point your browser to ``http://localhost:8050`` you should see your dashboard.
+Next step is finding a nice url in your organization's domain, and forwarding it
 to your dashboard server.
 
 With waitress you would call::
@@ -70,19 +70,19 @@ You need to pass the Flask ``server`` instance and the ``url_base_pathname`` to
 under ``db.app.index``::
 
     from flask import Flask
-    
+
     app = Flask(__name__)
 
     [...]
-    
+
     db = ExplainerDashboard(explainer, server=app, url_base_pathname="/dashboard/")
 
     @app.route('/dashboard')
     def return_dashboard():
         return db.app.index()
 
 
-.. highlight:: bash 
+.. highlight:: bash
 
 Now you can start the dashboard by::
 
@@ -95,12 +95,12 @@ Deploying to heroku
 ===================
 
 In case you would like to deploy to `heroku <www.heroku.com>`_ (which is normally
-the simplest option for dash apps, see 
-`dash instructions here <https://dash.plotly.com/deployment>`_). The demonstration 
+the simplest option for dash apps, see
+`dash instructions here <https://dash.plotly.com/deployment>`_). The demonstration
 dashboard is also hosted on heroku at `titanicexplainer.herokuapp.com <http://titanicexplainer.herokuapp.com>`_.
 
-In order to deploy the heroku there are a few things to keep in mind. First of 
-all you need to add ``explainerdashboard`` and ``gunicorn`` to 
+In order to deploy the heroku there are a few things to keep in mind. First of
+all you need to add ``explainerdashboard`` and ``gunicorn`` to
 ``requirements.txt`` (pinning is recommended to force a new build of your environment
 whenever you upgrade versions)::
 
@@ -112,8 +112,8 @@ your explainer in ``runtime.txt``::
 
     python-3.8.6
 
-(supported versions as of this writing are ``python-3.9.0``, ``python-3.8.6``, 
-``python-3.7.9`` and ``python-3.6.12``, but check the 
+(supported versions as of this writing are ``python-3.9.0``, ``python-3.8.6``,
+``python-3.7.9`` and ``python-3.6.12``, but check the
 `heroku documentation <https://devcenter.heroku.com/articles/python-support#supported-runtimes>`_
 for the latest)
 
@@ -126,10 +126,10 @@ And you need to tell heroku how to start your server in ``Procfile``::
 Graphviz buildpack
 ------------------
 
-If you want to visualize individual trees inside your ``RandomForest`` or ``xgboost`` 
+If you want to visualize individual trees inside your ``RandomForest`` or ``xgboost``
 model using the ``dtreeviz`` package you will
 need to make sure that ``graphviz`` is installed on your ``heroku`` dyno by
-adding the following buildstack (as well as the ``python`` buildpack): 
+adding the following buildstack (as well as the ``python`` buildpack):
 ``https://github.com/weibeld/heroku-buildpack-graphviz.git``
 
 (you can add buildpacks through the "settings" page of your heroku project)
@@ -150,11 +150,17 @@ E.g. **generate_dashboard.py**::
     X_train, y_train, X_test, y_test = titanic_survive()
     model = RandomForestClassifier(n_estimators=50, max_depth=5).fit(X_train, y_train)
 
-    explainer = ClassifierExplainer(model, X_test, y_test, 
+    explainer = ClassifierExplainer(model, X_test, y_test,
                                     cats=["Sex", 'Deck', 'Embarked'],
                                     labels=['Not Survived', 'Survived'],
                                     descriptions=feature_descriptions)
 
+    # For sklearn/imblearn pipeline models you can alternatively use:
+    # explainer = ClassifierExplainer(
+    #     pipeline_model, X_test, y_test,
+    #     strip_pipeline_prefix=True,
+    #     auto_detect_pipeline_cats=True)
+
     db = ExplainerDashboard(explainer)
     db.to_yaml("dashboard.yaml", explainerfile="explainer.joblib", dump_explainer=True)
 
@@ -193,45 +199,45 @@ Reducing memory usage
 
 If you deploy the dashboard with a large dataset with a large number of rows (``n``)
 and a large number of columns (``m``),
-it can use up quite a bit of memory: the dataset itself, shap values, 
+it can use up quite a bit of memory: the dataset itself, shap values,
 shap interaction values and any other calculated properties are alle kept in
 memory in order to make the dashboard responsive. You can check the (approximate)
 memory usage with ``explainer.memory_usage()``. In order to reduce the memory
 footprint there are a number of things you can do:
 
 1. Not including shap interaction tab.
-    Shap interaction values are shape ``n*m*m``, so can take a subtantial amount 
-    of memory, especially if you have a significant amount of columns ``m``. 
-2. Setting a lower precision. 
+    Shap interaction values are shape ``n*m*m``, so can take a subtantial amount
+    of memory, especially if you have a significant amount of columns ``m``.
+2. Setting a lower precision.
     By default shap values are stored as ``'float64'``,
     but you can store them as ``'float32'`` instead and save half the space:
-    ```ClassifierExplainer(model, X_test, y_test, precision='float32')```. You 
+    ```ClassifierExplainer(model, X_test, y_test, precision='float32')```. You
     can also set a lower precision on your ``X_test`` dataset yourself ofcourse.
 3. Drop non-positive class shap values.
     For multi class classifiers, by default ``ClassifierExplainer`` calculates
     shap values for all classes. If you are only interested in a single class
     you can drop the other shap values with ``explainer.keep_shap_pos_label_only(pos_label)``
-4. Storing row data externally and loading on the fly. 
+4. Storing row data externally and loading on the fly.
     You can for example only store a subset of ``10.000`` rows in
     the ``explainer`` itself (enough to generate representative importance and dependence plots),
-    and store the rest of your millions of rows of input data in an external file 
+    and store the rest of your millions of rows of input data in an external file
     or database that get loaded one by one with the following functions:
 
-    - with ``explainer.set_X_row_func()`` you can set a function that takes 
+    - with ``explainer.set_X_row_func()`` you can set a function that takes
       an `index` as argument and returns a single row dataframe with model
       compatible input data for that index. This function can include a query
-      to a database or fileread. 
-    - with ``explainer.set_y_func()`` you can set a function that takes 
+      to a database or fileread.
+    - with ``explainer.set_y_func()`` you can set a function that takes
       and `index` as argument and returns the observed outcome ``y`` for
       that index.
-    - with ``explainer.set_index_list_func()`` you can set a function 
+    - with ``explainer.set_index_list_func()`` you can set a function
       that returns a list of available indexes that can be queried.
-    
-    If the number of indexes is too long to fit in a dropdown you can pass 
+
+    If the number of indexes is too long to fit in a dropdown you can pass
     ``index_dropdown=False`` which turns the dropdowns into free text fields.
-    Instead of an ``index_list_func`` you can also set an 
+    Instead of an ``index_list_func`` you can also set an
     ``explainer.set_index_check_func(func)`` which should return a bool whether
-    the ``index`` exists or not. 
+    the ``index`` exists or not.
 
     Important: these function can be called multiple times by multiple independent
     components, so probably best to implement some kind of caching functionality.
@@ -242,22 +248,22 @@ footprint there are a number of things you can do:
 Setting logins and password
 ===========================
 
-``ExplainerDashboard`` supports `dash basic auth functionality <https://dash.plotly.com/authentication>`_. 
+``ExplainerDashboard`` supports `dash basic auth functionality <https://dash.plotly.com/authentication>`_.
 ``ExplainerHub`` uses ``flask_simple_login`` for its user authentication.
 
-You can simply add a list of logins to the ``ExplainerDashboard`` to force a login 
+You can simply add a list of logins to the ``ExplainerDashboard`` to force a login
 and prevent random users from accessing the details of your model dashboard::
 
     ExplainerDashboard(explainer, logins=[['login1', 'password1'], ['login2', 'password2']]).run()
 
-Whereas :ref:`ExplainerHub<ExplainerHub>` has somewhat more intricate user management 
-using ``FlaskLogin``, but the basic syntax is the same. See the 
+Whereas :ref:`ExplainerHub<ExplainerHub>` has somewhat more intricate user management
+using ``FlaskLogin``, but the basic syntax is the same. See the
 :ref:`ExplainerHub documetation<ExplainerHub>` for more details::
 
     hub = ExplainerHub([db1, db2], logins=[['login1', 'password1'], ['login2', 'password2']])
 
-Make sure not to check these login/password pairs into version control though, 
-but store them somewhere safe! ``ExplainerHub`` stores passwords into a hashed 
+Make sure not to check these login/password pairs into version control though,
+but store them somewhere safe! ``ExplainerHub`` stores passwords into a hashed
 format by default.
 
 
@@ -266,20 +272,20 @@ Automatically restart gunicorn server upon changes
 
 We can use the ``explainerdashboard`` CLI tools to automatically rebuild our
 explainer whenever there is a change to the underlying
-model, dataset or explainer configuration. And we we can use ``kill -HUP gunicorn.pid`` 
-to force the gunicorn to restart and reload whenever a new ``explainer.joblib`` 
-is generated or the dashboard configuration ``dashboard.yaml`` changes. These two 
-processes together ensure that the dashboard automatically updates whenever there 
+model, dataset or explainer configuration. And we we can use ``kill -HUP gunicorn.pid``
+to force the gunicorn to restart and reload whenever a new ``explainer.joblib``
+is generated or the dashboard configuration ``dashboard.yaml`` changes. These two
+processes together ensure that the dashboard automatically updates whenever there
 are underlying changes.
 
-First we store the explainer config in ``explainer.yaml`` and the dashboard 
+First we store the explainer config in ``explainer.yaml`` and the dashboard
 config in ``dashboard.yaml``. We also indicate which modelfiles and datafiles the
-explainer depends on, and which columns in the datafile should be used as 
+explainer depends on, and which columns in the datafile should be used as
 a target and which as index::
 
     explainer = ClassifierExplainer(model, X, y, labels=['Not Survived', 'Survived'])
     explainer.dump("explainer.joblib")
-    explainer.to_yaml("explainer.yaml", 
+    explainer.to_yaml("explainer.yaml",
                     modelfile="model.pkl",
                     datafile="data.csv",
                     index_col="Name",
@@ -300,12 +306,12 @@ directly from the config file::
 
 .. highlight:: bash
 
-Now we would like to rebuild the ``explainer.joblib`` file whenever there is a 
-change to ``model.pkl``, ``data.csv`` or ``explainer.yaml`` by running 
-``explainerdashboard build``. And we restart the ``gunicorn`` server whenever 
-there is a change in ``explainer.joblib`` or ``dashboard.yaml`` by killing 
-the gunicorn server with ``kill -HUP pid`` To do that we need to install 
-the python package ``watchdog`` (``pip install watchdog[watchmedo]``). This 
+Now we would like to rebuild the ``explainer.joblib`` file whenever there is a
+change to ``model.pkl``, ``data.csv`` or ``explainer.yaml`` by running
+``explainerdashboard build``. And we restart the ``gunicorn`` server whenever
+there is a change in ``explainer.joblib`` or ``dashboard.yaml`` by killing
+the gunicorn server with ``kill -HUP pid`` To do that we need to install
+the python package ``watchdog`` (``pip install watchdog[watchmedo]``). This
 package can keep track of filechanges and execute shell-scripts upon file changes.
 
 So we can start the gunicorn server and the two watchdog filechange trackers
@@ -321,17 +327,14 @@ from a shell script ``start_server.sh``::
 
     wait # wait till user hits ctrl-c to exit and kill all three processes
 
-Now we can simply run ``chmod +x start_server.sh`` and ``./start_server.sh`` to 
+Now we can simply run ``chmod +x start_server.sh`` and ``./start_server.sh`` to
 get our server up and running.
 
-Whenever we now make a change to either one of the source files 
+Whenever we now make a change to either one of the source files
 (``model.pkl``, ``data.csv`` or ``explainer.yaml``), this produces a fresh
 ``explainer.joblib``. And whenever there is a change to either ``explainer.joblib``
-or ``dashboard.yaml`` gunicorns restarts and rebuild the dashboard. 
-
-So you can keep an explainerdashboard running without interuption and simply 
-an updated ``model.pkl`` or a fresh dataset ``data.csv`` into the directory and 
-the dashboard will automatically update. 
-
-
+or ``dashboard.yaml`` gunicorns restarts and rebuild the dashboard.
 
+So you can keep an explainerdashboard running without interuption and simply
+an updated ``model.pkl`` or a fresh dataset ``data.csv`` into the directory and
+the dashboard will automatically update.