-
-
Notifications
You must be signed in to change notification settings - Fork 34
SLEP016: parameter spaces on estimators #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
jnothman
wants to merge
14
commits into
scikit-learn:main
Choose a base branch
from
jnothman:slep014-search-spaces
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 2 commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
c619fdc
Partial draft of SLEP014: parameter spaces on estimators
jnothman 79b67ff
typos
jnothman 85e064b
Apply suggestions from code review
jnothman 046f108
Add some discussion points from Andy's code review
jnothman 239ff8a
Reintroduce edits after poor merge
jnothman 8541950
Add docstring for set_grid
jnothman 05be687
Attempt to complete Implementation section
jnothman f39e095
Correct SLEP number to 016
jnothman e6c61c4
Complete draft of the SLEP
jnothman 7f3df10
Merge remote-tracking branch 'upstream/main' into slep014-search-spaces
45adda4
Comment on aidss approach
jnothman 748fa31
address review comments
jnothman 0c946c7
address reviews
jnothman bb56429
Add clarifcation
jnothman File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,200 @@ | ||
.. _slep_014: | ||
|
||
======================================= | ||
SLEP014: Parameter Spaces on Estimators | ||
======================================= | ||
|
||
:Author: Joel Nothman | ||
:Status: Draft | ||
:Type: Standards Track | ||
:Created: 2021-11-30 | ||
|
||
Abstract | ||
-------- | ||
|
||
This proposes to simplify the specification of parameter searches by allowing | ||
the user to store candidate values for each parameter on each estimator. | ||
The ``*SearchCV`` estimators would then have a setting to construct the | ||
parameter grid or distribution from a traversal of the supplied estimator. | ||
jnothman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Detailed description | ||
-------------------- | ||
|
||
The ability to set and get parameters from deep within nested estimators using | ||
``get_params`` and ``set_params`` is powerful, but the specification of | ||
parameter spaces to search can be very unfriendly for users. | ||
In particular, the structure of the parameter grid specification needs to | ||
reflect the structure of the estimator, with every path explicitly notated with | ||
jnothman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
``__``-separated elements. | ||
|
||
For example, `one example <https://github.com/scikit-learn/scikit-learn/blob/d4d5f8c/examples/compose/plot_compare_reduction.py>`__ | ||
proposes searching over alternative preprocessing steps in a Pipeline and their | ||
parameters, as well as the parameters of the downstream classifier. | ||
jnothman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
:: | ||
|
||
from sklearn.pipeline import Pipeline | ||
from sklearn.svm import LinearSVC | ||
from sklearn.decomposition import PCA, NMF | ||
from sklearn.feature_selection import SelectKBest, chi2 | ||
|
||
pipe = Pipeline( | ||
[ | ||
# the reduce_dim stage is populated by the param_grid | ||
("reduce_dim", "passthrough"), | ||
("classify", LinearSVC(dual=False, max_iter=10000)), | ||
] | ||
) | ||
|
||
N_FEATURES_OPTIONS = [2, 4, 8] | ||
C_OPTIONS = [1, 10, 100, 1000] | ||
param_grid = [ | ||
{ | ||
"reduce_dim": [PCA(iterated_power=7)], | ||
"reduce_dim__n_components": N_FEATURES_OPTIONS, | ||
"classify__C": C_OPTIONS, | ||
}, | ||
{ | ||
"reduce_dim": [SelectKBest(chi2)], | ||
"reduce_dim__k": N_FEATURES_OPTIONS, | ||
"classify__C": C_OPTIONS, | ||
}, | ||
] | ||
|
||
Here we see that in order to specify the search space for the 'k' parameter of | ||
``SelectKBest``, the user needs to identify its fully qualified path from the | ||
root estimator (``pipe``) that will be passed to the grid search estimator, | ||
i.e. ``reduce_dim__k``. To construct this fully qualified parameter name, the | ||
user must know that the ``SelectKBest`` estimator resides in a ``Pipeline`` | ||
step named ``reduce_dim`` and that the Pipeline is not further nested in | ||
another estimator. Changing the name of ``reduce_dim`` would entail a change to | ||
5 lines in the above code snippet. | ||
|
||
We also see that the options for ``classify__C`` need to be specified twice. | ||
Were the candidate values for ``C`` something that belonged to the | ||
``LinearSVC`` estimator instance, rather than part of a grid specification, it | ||
would be possible to specify it only once. The use of a list of two separate | ||
dicts of parameter spaces is altogether avoidable, where the only reason the | ||
jnothman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
parameter space is duplicated is to handle the alternation of one step in the | ||
pipeline; for all other steps, it makes sense for the candidate parameter | ||
space to remain constant regardless of whether ``reduce_dim`` is a feature | ||
selector or a PCA. | ||
|
||
Here we propose to allow the user to specify candidates or distributions for | ||
jnothman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
local parameters on a specific estimator estimator instance:: | ||
|
||
svc = LinearSVC(dual=False, max_iter=10000).set_grid(C=C_OPTIONS) | ||
pipe = Pipeline( | ||
[ | ||
# the reduce_dim stage is populated by the param_grid | ||
("reduce_dim", "passthrough"), | ||
("classify", svc), | ||
] | ||
).set_grid(reduce_dim=[ | ||
PCA(iterated_power=7).set_grid(n_components=N_FEATURES_OPTIONS), | ||
SelectKBest().set_grid(k=N_FEATURES_OPTIONS), | ||
]) | ||
|
||
With this use of ``set_grid``, ``GridSearchCV(pipe)`` would not need the | ||
parameter grid to be specified explicitly. Instead, a recursive descent through | ||
``pipe``'s parameters allows it to reconstruct exactly the grid used in the | ||
example above. | ||
|
||
Such functionality therefore allows users to: | ||
|
||
* easily define a parameter space together with the estimator they relate to, | ||
improving code cohesion. | ||
* establish a library of estimator configurations for reuse, reducing repeated | ||
code, and reducing setup costs for auto-ML approaches. | ||
jnothman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* avoid work modifying the parameter space specification when a composite | ||
estimator's strucutre is changed, or a Pipeline step is renamed. | ||
* more comfortably specify search spaces that include the alternation of a | ||
step in a Pipeline (or ColumnTransformer, etc.), creating a form of | ||
conditional dependency in the search space. | ||
jnothman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Implementation | ||
-------------- | ||
|
||
TODO | ||
|
||
setter, getter for grid. | ||
setter, getter for distribution. | ||
Overwriting behaviour | ||
|
||
Private attribute on estimator, dynamically allocated on request | ||
|
||
Grid Search update to handle param_grid='extract' using the algorithm | ||
and implementation from searchgrid [1]_. If an empty grid is extracted | ||
jnothman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Randomized Search update to handle param_distributions='extract', using ``get_grid`` | ||
jnothman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
only to update the results of ``get_distribution``. | ||
|
||
Parameter spaces should be copied in clone, so that a user can overwrite only | ||
one parameter's space without redefining everything. | ||
|
||
Backward compatibility | ||
---------------------- | ||
|
||
No concerns | ||
|
||
Alternatives | ||
------------ | ||
|
||
TODO | ||
|
||
no methhods, but storing on est | ||
GridFactory (:issue:`21784`) | ||
|
||
Alternative syntaxes | ||
|
||
one call per param? | ||
duplicate vs single method for grid vs distbn | ||
jnothman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
make_pipeline alternative or extension to avoid declaring 'passthrough' | ||
|
||
searchgrid [1]_, Neuraxle [2]_ | ||
|
||
Discussion | ||
---------- | ||
|
||
raised in :issue:`19045`. | ||
|
||
:issue:`9610`: our solution does not directly meet the need for conditional | ||
dependencies within a single estimator, e.g:: | ||
jnothman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
param_grid = [ | ||
{ | ||
"kernel": ["rbf"], | ||
"gamma": [.001, .0001], | ||
"C": [1, 10], | ||
}, | ||
{ | ||
"kernel": ["linear"], | ||
"C": [1, 10], | ||
} | ||
] | ||
|
||
jnothman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
searchgrid's implementation was mentioned in relation to | ||
https://github.com/scikit-learn/scikit-learn/issues/7707#issuecomment-392298478 | ||
|
||
Not handled: ``__`` paths still used in ``cv_results_`` | ||
|
||
This section may just be a bullet list including links to any discussions | ||
regarding the SLEP: | ||
|
||
- This includes links to mailing list threads or relevant GitHub issues. | ||
|
||
|
||
References and Footnotes | ||
------------------------ | ||
|
||
.. [1] Joel Nothman (2017). *SearchGrid*. Software Release. | ||
https://searchgrid.readthedocs.io/ | ||
|
||
.. [2] Guillaume Chevalier, Alexandre Brilliant and Eric Hamel (2019). | ||
*Neuraxle - A Python Framework for Neat Machine Learning Pipelines*. | ||
DOI:10.13140/RG.2.2.33135.59043. Software at https://www.neuraxle.org/ | ||
|
||
Copyright | ||
--------- | ||
|
||
This document has been placed in the public domain. |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.