-
-
Notifications
You must be signed in to change notification settings - Fork 34
SLEP005: Resampler API #15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 1 commit
0c49bfb
4ecc51b
c855ffe
c16ef7b
e2f6a70
8f8ebb6
ae03400
10c85ff
c39d615
2de0d48
387b338
5ecfead
e7faa6e
a4019ed
87a1d5d
5ddc6f9
cde164b
e87fd7e
ad4e94f
b989562
ee197cb
35c140d
bc45d6a
e306795
79123fb
8538e82
c64044d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -26,6 +26,7 @@ | |
slep002/proposal | ||
slep003/proposal | ||
slep004/proposal | ||
slep005/proposal | ||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,98 @@ | ||
.. _slep_005: | ||
|
||
===================== | ||
Outlier rejection API | ||
===================== | ||
|
||
:Author: Oliver Raush (oliverrausch99@gmail.com), Guillaume Lemaitre (g.lemaitre58@gmail.com) | ||
:Status: Draft | ||
:Type: Standards Track | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we have tracks? |
||
:Created: created on, in 2019-03-01 | ||
:Resolution: <url> | ||
|
||
Abstract | ||
-------- | ||
|
||
We propose a new mixin ``OutlierRejectionMixin`` implementing a | ||
``fit_resample(X, y)`` method. This method will remove samples from | ||
``X`` and ``y`` to get a outlier-free dataset. This method is also | ||
handle in ``Pipeline``. | ||
|
||
Detailed description | ||
-------------------- | ||
|
||
Fitting a machine learning model on an outlier-free dataset can be | ||
beneficial. Currently, the family of outlier detection algorithms | ||
allows to detect outliers using `estimator.fit_predict(X, y)`. However, | ||
there is no mechanism to remove outliers without any manual step. It | ||
is even impossible when a ``Pipeline`` is used. | ||
|
||
We propose the following changes: | ||
|
||
* implement an ``OutlierRejectionMixin``; | ||
* this mixin add a method ``fit_resample(X, y)`` removing outliers | ||
from ``X`` and ``y``; | ||
* ``fit_resample`` should be handled in ``Pipeline``. | ||
|
||
Implementation | ||
-------------- | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it should be noted that the required changes are mostly (exclusively?) to And of course our API and contract changes. But the implementation is all contained within the pipeline and the resamplers? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, changes are limited to new resamplers and the composition implementation (So either |
||
|
||
API changes are implemented in | ||
https://github.com/scikit-learn/scikit-learn/pull/13269 | ||
|
||
Estimator implementation | ||
........................ | ||
|
||
The new mixin is implemented as:: | ||
|
||
class OutlierRejectionMixin: | ||
glemaitre marked this conversation as resolved.
Show resolved
Hide resolved
|
||
_estimator_type = "outlier_rejector" | ||
def fit_resample(self, X, y): | ||
inliers = self.fit_predict(X) == 1 | ||
return safe_mask(X, inliers), safe_mask(y, inliers) | ||
|
||
This will be used as follows for the outlier detection algorithms:: | ||
|
||
class IsolationForest(BaseBagging, OutlierMixin, OutlierRejectionMixin): | ||
... | ||
|
||
One can use the new algorithm with:: | ||
|
||
from sklearn.ensemble import IsolationForest | ||
estimator = IsolationForest() | ||
X_free, y_free = estimator.fit_resample(X, y) | ||
|
||
Pipeline implementation | ||
....................... | ||
|
||
To handle outlier rejector in ``Pipeline``, we enforce the following: | ||
|
||
* an estimator cannot implement both ``fit_resample(X, y)`` and | ||
``fit_transform(X)`` / ``transform(X)``. | ||
glemaitre marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* ``fit_predict(X)`` (i.e., clustering methods) should not be called if an | ||
outlier rejector is in the pipeline. | ||
|
||
glemaitre marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Backward compatibility | ||
---------------------- | ||
|
||
There is no backward incompatibilities with the current API. | ||
|
||
Discussion | ||
---------- | ||
|
||
* https://github.com/scikit-learn/scikit-learn/pull/13269 | ||
|
||
References and Footnotes | ||
------------------------ | ||
|
||
.. [1] Each SLEP must either be explicitly labeled as placed in the public | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fix this? |
||
domain (see this SLEP as an example) or licensed under the `Open | ||
Publication License`_. | ||
|
||
.. _Open Publication License: https://www.opencontent.org/openpub/ | ||
|
||
|
||
Copyright | ||
--------- | ||
|
||
This document has been placed in the public domain. [1]_ |
Uh oh!
There was an error while loading. Please reload this page.