Skip to content

SLEP005: Resampler API #15

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 27 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
0c49bfb
SLEP005: Outlier Rejection API
glemaitre Mar 1, 2019
4ecc51b
Update slep005/proposal.rst
orausch Mar 2, 2019
c855ffe
Update slep
glemaitre Mar 2, 2019
c16ef7b
Update slep005/proposal.rst
adrinjalali Mar 5, 2019
e2f6a70
Update proposal based on discussion
orausch Jun 25, 2019
8f8ebb6
Reword semisupervised usecase
orausch Jun 25, 2019
ae03400
Add description of first few pipeline methods
orausch Jun 26, 2019
10c85ff
Add code examples
orausch Jun 26, 2019
c39d615
formatting
orausch Jun 26, 2019
2de0d48
Formatting and cleanup
orausch Jun 27, 2019
387b338
even more formatting
orausch Jun 27, 2019
5ecfead
more formatting
orausch Jun 27, 2019
e7faa6e
try these headings
orausch Jun 27, 2019
a4019ed
last one
orausch Jun 27, 2019
87a1d5d
Some changes based on the discussion in the thread (#1)
glemaitre Jul 3, 2019
5ddc6f9
minor rephrasing
glemaitre Jul 3, 2019
cde164b
address comments
glemaitre Jul 3, 2019
e87fd7e
Apply suggestions from code review
glemaitre Jul 3, 2019
ad4e94f
Some text about resampling pipelines and their issues
jnothman Jul 3, 2019
b989562
Some text about resampling pipelines and their issues (#2)
jnothman Jul 3, 2019
ee197cb
minor changes
glemaitre Jul 3, 2019
35c140d
iter
glemaitre Jul 3, 2019
bc45d6a
Some comments on fit params
jnothman Aug 26, 2019
e306795
Merge branch 'slep005' of https://github.com/glemaitre/enhancement_pr…
jnothman Aug 26, 2019
79123fb
Slep005: Some comments on fit params (#3)
glemaitre Sep 9, 2019
8538e82
Merge branch 'master' into slep005
glemaitre Sep 23, 2019
c64044d
Merge branch 'master' into slep005
jnothman Aug 17, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
slep002/proposal
slep003/proposal
slep004/proposal
slep005/proposal

.. toctree::
:maxdepth: 1
Expand Down
98 changes: 98 additions & 0 deletions slep005/proposal.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
.. _slep_005:

=====================
Outlier rejection API
=====================

:Author: Oliver Raush (oliverrausch99@gmail.com), Guillaume Lemaitre (g.lemaitre58@gmail.com)
:Status: Draft
:Type: Standards Track
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have tracks?

:Created: created on, in 2019-03-01
:Resolution: <url>

Abstract
--------

We propose a new mixin ``OutlierRejectionMixin`` implementing a
``fit_resample(X, y)`` method. This method will remove samples from
``X`` and ``y`` to get a outlier-free dataset. This method is also
handle in ``Pipeline``.

Detailed description
--------------------

Fitting a machine learning model on an outlier-free dataset can be
beneficial. Currently, the family of outlier detection algorithms
allows to detect outliers using `estimator.fit_predict(X, y)`. However,
there is no mechanism to remove outliers without any manual step. It
is even impossible when a ``Pipeline`` is used.

We propose the following changes:

* implement an ``OutlierRejectionMixin``;
* this mixin add a method ``fit_resample(X, y)`` removing outliers
from ``X`` and ``y``;
* ``fit_resample`` should be handled in ``Pipeline``.

Implementation
--------------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be noted that the required changes are mostly (exclusively?) to Pipeline.
Is anything else affected? Some other meta-estimators?

And of course our API and contract changes. But the implementation is all contained within the pipeline and the resamplers?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, changes are limited to new resamplers and the composition implementation (So either Pipeline or ResampledTrainer). I've also had to make several changes to existing estimator checks since many of them assumed that the estimator implements fit.


API changes are implemented in
https://github.com/scikit-learn/scikit-learn/pull/13269

Estimator implementation
........................

The new mixin is implemented as::

class OutlierRejectionMixin:
_estimator_type = "outlier_rejector"
def fit_resample(self, X, y):
inliers = self.fit_predict(X) == 1
return safe_mask(X, inliers), safe_mask(y, inliers)

This will be used as follows for the outlier detection algorithms::

class IsolationForest(BaseBagging, OutlierMixin, OutlierRejectionMixin):
...

One can use the new algorithm with::

from sklearn.ensemble import IsolationForest
estimator = IsolationForest()
X_free, y_free = estimator.fit_resample(X, y)

Pipeline implementation
.......................

To handle outlier rejector in ``Pipeline``, we enforce the following:

* an estimator cannot implement both ``fit_resample(X, y)`` and
``fit_transform(X)`` / ``transform(X)``.
* ``fit_predict(X)`` (i.e., clustering methods) should not be called if an
outlier rejector is in the pipeline.

Backward compatibility
----------------------

There is no backward incompatibilities with the current API.

Discussion
----------

* https://github.com/scikit-learn/scikit-learn/pull/13269

References and Footnotes
------------------------

.. [1] Each SLEP must either be explicitly labeled as placed in the public
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix this?

domain (see this SLEP as an example) or licensed under the `Open
Publication License`_.

.. _Open Publication License: https://www.opencontent.org/openpub/


Copyright
---------

This document has been placed in the public domain. [1]_