-
-
Notifications
You must be signed in to change notification settings - Fork 34
SLEP005: Resampler API #15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 3 commits
0c49bfb
4ecc51b
c855ffe
c16ef7b
e2f6a70
8f8ebb6
ae03400
10c85ff
c39d615
2de0d48
387b338
5ecfead
e7faa6e
a4019ed
87a1d5d
5ddc6f9
cde164b
e87fd7e
ad4e94f
b989562
ee197cb
35c140d
bc45d6a
e306795
79123fb
8538e82
c64044d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -26,6 +26,7 @@ | |
slep002/proposal | ||
slep003/proposal | ||
slep004/proposal | ||
slep005/proposal | ||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,109 @@ | ||||||
.. _slep_005: | ||||||
|
||||||
============= | ||||||
Resampler API | ||||||
============= | ||||||
|
||||||
:Author: Oliver Raush (oliverrausch99@gmail.com), | ||||||
glemaitre marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
Christos Aridas (char@upatras.gr), | ||||||
Guillaume Lemaitre (g.lemaitre58@gmail.com) | ||||||
:Status: Draft | ||||||
:Type: Standards Track | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we have tracks? |
||||||
:Created: created on, in 2019-03-01 | ||||||
:Resolution: <url> | ||||||
|
||||||
Abstract | ||||||
-------- | ||||||
|
||||||
We propose the inclusion of a new type of estimator: resampler. The | ||||||
resampler will change the samples in ``X`` and ``y``. In short: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the main point is that it is an estimator that returns y right? Something that is somewhat related to this SLEP is PLS. Our PLS needs some love but it can also return X and Y from transform. Is this also covered here? Does that need a different API? Is it considered out of scope of sklearn and should be deprecated? |
||||||
|
||||||
* resamplers will reduce or augment the number of samples in ``X`` and | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
more specifically, their method is different because it does not require correspondence between input and output samples. At fit time. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think "input" and "output" are not good words. I thought you meant Is this the real difference? I guess that would be forbidden by the common tests right now. Maybe there are two main points: the number of samples returned can change, and it returns X and y? I remember people talking about using Birch for replacing samples by cluster centers. Birch is also a transformer. The pipeline wouldn't know what to do when something is a transformer and a resampler, right? So they must be exclusive? That's also a new concept. We don't have any exclusive features, I think, i.e. we don't really have types of estimators, any estimator can have any method right now, but not after this SLEP. |
||||||
``y``; | ||||||
* ``Pipeline`` should treat them as a separate type of estimator. | ||||||
|
||||||
Motivation | ||||||
---------- | ||||||
|
||||||
Sample reduction or augmentation are part of machine-learning | ||||||
pipeline. The current scikit-learn API does not offer support for such | ||||||
use cases. | ||||||
|
||||||
Two possible use cases are currently reported: | ||||||
|
||||||
* sample rebalancing to correct bias toward class with large cardinality; | ||||||
* outlier rejection to fit a clean dataset. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is outlier rejection for data cleaning actually a use-case? Can you point to a reference where someone used this? I have never heard of anyone using this as preprocessing and it strikes me as a bad idea. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another use-case that I think is much more common is data augmentation. It's maybe not as common with the kind of data typical for sklearn but it's still a very common practice in ML. |
||||||
|
||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You should mention sample reduction e.g. representation of a dataset by its kmeans centroids. I also note that the former two cases can be handled merely by the ability to modify sample_weight in a Pipeline. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think mentioning this in the alternative implementations is enough. |
||||||
Implementation | ||||||
-------------- | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it should be noted that the required changes are mostly (exclusively?) to And of course our API and contract changes. But the implementation is all contained within the pipeline and the resamplers? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, changes are limited to new resamplers and the composition implementation (So either |
||||||
|
||||||
To handle outlier rejector in ``Pipeline``, we enforce the following: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
* an estimator cannot implement both ``fit_resample(X, y)`` and | ||||||
``fit_transform(X)`` / ``transform(X)``. If both are implemented, | ||||||
``Pipeline`` will not be able to know which of the two methods to | ||||||
call. | ||||||
* resamplers are only applied during ``fit``. Otherwise, scoring will | ||||||
glemaitre marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
be harder. Specifically, the pipeline will act as follows: | ||||||
|
||||||
===================== ================================ | ||||||
Method Resamplers applied | ||||||
===================== ================================ | ||||||
``fit`` Yes | ||||||
``fit_transform`` Yes | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It can't. The sample correspondence will be violated. It must raise an error as does There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was some more discussion on this in #15 (comment). I agree that raising an error seems like the best option. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Most consistent is to define fit_transform as applying fit and then transform. Maybe it's just not entirely clear what "yes" in this table means. The samples output by fit_transform must correspond to the input, they cannot be resampled There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I dislike defining If calling Calling fit_transform on a pipeline containing resamplers is not supported. You may want to call fit and transform separately. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think @jnothman point is that it should be equivalent to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok I'm pretty convinced now that this should be "not supported" and we should explain below why. |
||||||
``fit_resample`` Yes | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can someone give me an example where a pipeline is used as a resampler? Can we just not have a pipeline have that and get rid of the fit_transform/fit.transform issue? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's an example where we need a pipeline to provide |
||||||
``transform`` No | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If a Pipeline provides There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Now I remember one of the reasons we cannot support There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Wait, no, as long as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry I lost you lol. |
||||||
``predict`` No | ||||||
``score`` No | ||||||
``fit_predict`` not supported | ||||||
===================== ================================ | ||||||
|
||||||
* ``fit_predict(X)`` (i.e., clustering methods) should not be called | ||||||
if an outlier rejector is in the pipeline. The output will be of | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is weird because we allow |
||||||
different size than ``X`` breaking metric computation. | ||||||
* in a supervised scheme, resampler will need to validate which type | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't understand this bullet and the second sentence doesn't parse. |
||||||
of target is passed. Up to our knowledge, supervised are used for | ||||||
binary and multiclass classification. | ||||||
|
||||||
Alternative implementation | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Alternative 3: sample weights |
||||||
.......................... | ||||||
|
||||||
Alternatively ``sample_weight`` could be used as a placeholder to | ||||||
perform resampling. However, the current limitations are: | ||||||
|
||||||
* ``sample_weight`` is not available for all estimators; | ||||||
* ``sample_weight`` will implement only sample reductions; | ||||||
glemaitre marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
* ``sample_weight`` can be applied at both fit and predict time; | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not sure about this... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah we don't allow it being passed to I think the main limitation is that estimators can't produce sample weights, so there is no way to actually implement this in a pipeline right now without adding a new method, right? Also I think runtime is a major issue for highly imbalanced datasets. If I have 1000:1 imbalance there will be a big runtime difference between using balanced random forests and random forests with sample weights. |
||||||
* ``sample_weight`` need to be passed and modified within a | ||||||
glemaitre marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
``Pipeline``. | ||||||
glemaitre marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Like everything, a meta-estimator could be implemented providing this functionality. However this is difficult to use, unintuitive, and would have to implement all methods and attributes appropriate to the contexts where this may be used (regression, decomposition, etc) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It might be illustrative to give a common pipeline with a meta-estimator and a pipeline approach? |
||||||
Current implementation | ||||||
...................... | ||||||
|
||||||
glemaitre marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add that this is an implementation of alternative 1 |
||||||
* Outlier rejection are implemented in: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shouldn't we link to imbalance-learn? What are the differences between this proposal and imbalance-learn if any? |
||||||
https://github.com/scikit-learn/scikit-learn/pull/13269 | ||||||
|
||||||
Backward compatibility | ||||||
---------------------- | ||||||
|
||||||
There is no backward incompatibilities with the current API. | ||||||
|
||||||
Discussion | ||||||
---------- | ||||||
|
||||||
* https://github.com/scikit-learn/scikit-learn/pull/13269{ | ||||||
glemaitre marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
References and Footnotes | ||||||
------------------------ | ||||||
|
||||||
.. [1] Each SLEP must either be explicitly labeled as placed in the public | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fix this? |
||||||
domain (see this SLEP as an example) or licensed under the `Open | ||||||
Publication License`_. | ||||||
|
||||||
.. _Open Publication License: https://www.opencontent.org/openpub/ | ||||||
|
||||||
|
||||||
Copyright | ||||||
--------- | ||||||
|
||||||
This document has been placed in the public domain. [1]_ |
Uh oh!
There was an error while loading. Please reload this page.