-
-
Notifications
You must be signed in to change notification settings - Fork 34
SLEP016: parameter spaces on estimators #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 9 commits
c619fdc
79b67ff
85e064b
046f108
239ff8a
8541950
05be687
f39e095
e6c61c4
7f3df10
45adda4
748fa31
0c946c7
bb56429
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,354 @@ | ||
.. _slep_016: | ||
|
||
======================================= | ||
SLEP016: Parameter Spaces on Estimators | ||
======================================= | ||
|
||
:Author: Joel Nothman | ||
:Status: Draft | ||
:Type: Standards Track | ||
:Created: 2021-11-30 | ||
|
||
Abstract | ||
-------- | ||
|
||
This proposes to simplify the specification of parameter searches by allowing | ||
the user to store candidate values for each parameter on the corresponding estimator. | ||
The ``*SearchCV`` estimators would then have a setting to construct the | ||
parameter grid or distribution from the supplied estimator. | ||
|
||
Detailed description | ||
-------------------- | ||
|
||
The ability to set and get parameters from deep within nested estimators using | ||
``get_params`` and ``set_params`` is powerful, but the specification of | ||
parameter spaces to search can be very unfriendly for users. | ||
In particular, the structure of the parameter grid specification needs to | ||
reflect the structure of the estimator, with every path explicitly specified by | ||
``__``-separated elements. | ||
|
||
For example, `one example <https://github.com/scikit-learn/scikit-learn/blob/d4d5f8c/examples/compose/plot_compare_reduction.py>`__ | ||
proposes searching over alternative preprocessing steps in a Pipeline and their | ||
parameters, as well as the parameters of the downstream classifier. | ||
|
||
:: | ||
|
||
from sklearn.pipeline import Pipeline | ||
from sklearn.svm import LinearSVC | ||
from sklearn.decomposition import PCA, NMF | ||
from sklearn.feature_selection import SelectKBest, chi2 | ||
|
||
pipe = Pipeline( | ||
[ | ||
# the reduce_dim stage is populated by the param_grid | ||
("reduce_dim", "passthrough"), | ||
("classify", LinearSVC(dual=False, max_iter=10000)), | ||
] | ||
) | ||
|
||
N_FEATURES_OPTIONS = [2, 4, 8] | ||
C_OPTIONS = [1, 10, 100, 1000] | ||
param_grid = [ | ||
{ | ||
"reduce_dim": [PCA(iterated_power=7)], | ||
"reduce_dim__n_components": N_FEATURES_OPTIONS, | ||
"classify__C": C_OPTIONS, | ||
}, | ||
{ | ||
"reduce_dim": [SelectKBest(chi2)], | ||
"reduce_dim__k": N_FEATURES_OPTIONS, | ||
"classify__C": C_OPTIONS, | ||
}, | ||
] | ||
|
||
Here we see that in order to specify the search space for the 'k' parameter of | ||
``SelectKBest``, the user needs to identify its fully qualified path from the | ||
root estimator (``pipe``) that will be passed to the grid search estimator, | ||
i.e. ``reduce_dim__k``. To construct this fully qualified parameter name, the | ||
user must know that the ``SelectKBest`` estimator resides in a ``Pipeline`` | ||
step named ``reduce_dim`` and that the Pipeline is not further nested in | ||
another estimator. Changing the step identifier ``reduce_dim`` would entail | ||
a change to 5 lines in the above code snippet. | ||
|
||
We also see that the options for ``classify__C`` need to be specified twice. | ||
It should be possible to specify it only once. The use of a list of two separate | ||
dicts of parameter spaces is similarly cumbersome: the only reason the | ||
parameter space is duplicated is to handle the alternation of one step in the | ||
pipeline; for all other steps, it makes sense for the candidate parameter | ||
space to remain constant regardless of whether ``reduce_dim`` is a feature | ||
selector or a PCA. | ||
|
||
This SLEP proposes to add a methods to estimators that allow the user | ||
to specify candidates or distributions for local parameters on a specific | ||
estimator estimator instance:: | ||
|
||
svc = LinearSVC(dual=False, max_iter=10000).set_grid(C=C_OPTIONS) | ||
jnothman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
pipe = Pipeline( | ||
[ | ||
# the reduce_dim stage is populated by the param_grid | ||
("reduce_dim", "passthrough"), | ||
("classify", svc), | ||
] | ||
).set_grid(reduce_dim=[ | ||
PCA(iterated_power=7).set_grid(n_components=N_FEATURES_OPTIONS), | ||
SelectKBest().set_grid(k=N_FEATURES_OPTIONS), | ||
]) | ||
|
||
With this use of ``set_grid``, ``GridSearchCV(pipe)`` would not need the | ||
parameter grid to be specified explicitly. Instead, a recursive descent through | ||
``pipe``'s parameters allows it to reconstruct exactly the grid used in the | ||
example above. | ||
|
||
Such functionality therefore allows users to: | ||
|
||
* easily define a parameter space together with the estimator they relate to, | ||
improving code cohesion. | ||
* establish a library of estimator configurations for reuse, reducing repeated | ||
code, and reducing setup costs for auto-ML approaches. As such, this change | ||
helps to enable :issue:`5004`. | ||
* avoid work modifying the parameter space specification when a composite | ||
estimator's strucutre is changed, or a Pipeline step is renamed. | ||
* more comfortably specify search spaces that include the alternation of a | ||
step in a Pipeline (or ColumnTransformer, etc.), creating a form of | ||
conditional dependency in the search space. | ||
|
||
History | ||
------- | ||
|
||
:issue:`5082`, :issue:`7608` and :issue:`19045` have all raised associating | ||
parameter search spaces directly with an estimator instance, while this | ||
has been supported by third party packages [1]_, [2]_. :issue:`21784` proposed | ||
a ``GridFactory``, but feedback suggested that methods on each estimator | ||
was more usable than an external utility. | ||
|
||
This proposal pertains to the Scikit-learn Roadmap entry "Better support for | ||
manual and automatic pipeline building" dating back to 2018. | ||
|
||
Implementation | ||
-------------- | ||
|
||
Four public methods will be added to ``BaseEstimator``:: | ||
|
||
def set_grid(self, **grid: List[object]): | ||
"""Sets candidate values for parameters in a search | ||
|
||
These candidates are used in grid search when a parameter grid is not | ||
explicitly specified. They are also used in randomized search in the | ||
case where set_distribution has not been used for the corresponding | ||
parameter. | ||
|
||
As with :meth:`set_params`, update semantics apply, such that | ||
``set_grid(param1=['a', 'b'], param2=[1, 2]).set_grid(param=['a'])`` | ||
will retain the candidates set for ``param2``. To reset the grid, | ||
each parameter's candidates should be set to ``[]``. | ||
|
||
Parameters | ||
---------- | ||
grid : Dict[Str, List[object]] | ||
Keyword arguments define the values to be searched for each | ||
specified parameter. | ||
|
||
Keywords must be valid parameter names from :meth:`get_params`. | ||
|
||
Returns | ||
------- | ||
self : Estimator | ||
""" | ||
... | ||
|
||
def get_grid(self): | ||
"""Retrieves current settings for parameters where search candidates are set | ||
|
||
Note that this only reflects local parameter candidates, and a grid | ||
including nested estimators can be constructed in combination with | ||
`get_params`. | ||
|
||
Returns | ||
------- | ||
dict | ||
A mapping from parameter name to a list of values. Each parameter | ||
name should be a member of `self.get_params(deep=False).keys()`. | ||
""" | ||
... | ||
|
||
def set_distribution(self, **distribution): | ||
jnothman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
"""Sets candidate values for parameters in a search | ||
|
||
These candidates are used in randomized search when a parameter | ||
distribution is not explicitly specified. For parameters where | ||
no distribution is defined and a grid is defined, those grid values | ||
will also be used. | ||
|
||
As with :meth:`set_params`, update semantics apply, such that | ||
``set_distribution(param1=['a', 'b'], param2=[1, 2]).set_grid(param=['a'])`` | ||
will retain the candidates set for ``param2``. To reset the grid, | ||
each parameter's candidates should be set to ``[]``. | ||
|
||
Parameters | ||
---------- | ||
distribution : mapping from str to RV or list | ||
Keyword arguments define the distribution to be searched for each | ||
specified parameter. | ||
Distributions may be specified either as an object with the method | ||
``rvs`` (see :mod:`scipy.stats`) or a list of discrete values with | ||
uniform distribution. | ||
|
||
Keywords must be valid parameter names from :meth:`get_params`. | ||
|
||
Returns | ||
------- | ||
self : Estimator | ||
""" | ||
... | ||
|
||
def get_distribution(self): | ||
"""Retrieves current settings for parameters where a search distribution is set | ||
|
||
Note that this only reflects local parameter candidates, and a joint distribution | ||
including nested estimators can be constructed in combination with | ||
`get_params`. | ||
|
||
For parameters where ``set_distribution`` has not been used, but ``set_grid`` | ||
has been, this will return the corresponding list of values specified in | ||
``set_grid``. | ||
|
||
Returns | ||
------- | ||
dict | ||
A mapping from parameter name to a scipy-compatible distribution | ||
(i.e. with ``rvs``` method) or list of discrete values. Each parameter | ||
name should be a member of `self.get_params(deep=False).keys()`. | ||
""" | ||
... | ||
|
||
The current distribution and grid values will be stored in a private | ||
attribute on the estimator, and ``get_grid`` may simply return this value, | ||
or an empty dict if undefined, while ``get_distribution`` will combine the | ||
stored parameter distributions with ``get_grid`` values. | ||
The attribute will be undefined by default upon construction of the estimator. | ||
|
||
Parameter spaces should be copied in :ojb:`sklearn.base.clone`, so that a user | ||
can overwrite only one parameter's space without redefining everything. | ||
To facilitate this (in the absence of a polymorphic implementation of clone), | ||
adrinjalali marked this conversation as resolved.
Show resolved
Hide resolved
|
||
we might need to store the candidate grids and distributions in a known instance | ||
attribute, or use a combination of `get_grid`, `get_distribution`, `get_params` | ||
and `set_grid`, `set_distribution` etc. to perform `clone`. | ||
|
||
Search estimators in `sklearn.model_selection` will be updated such that the | ||
currently required `param_grid` and `param_distributions` parameters will now default | ||
to 'extract'. The 'extract' value instructs the search estimator to construct | ||
a complete search space from the provided estimator's `get_grid` (respectively, | ||
`get_distribution`) return value together with `get_params`. | ||
It recursively calls `get_grid` (and `get_distribution`) on any parametrized | ||
objects (i.e. those with `get_params`) with this method that are descendent | ||
from the given estimator, including: | ||
* values in ``estimator.get_params(deep=True)`` | ||
* elements of list values in ``x.get_grid()`` or ``x.get_distribution()`` | ||
as appropriate (disregarding rvs) for any `x` descendant of the estimator. | ||
|
||
See the implementation of ``build_param_grid`` in Searchgrid [1]_, which applies | ||
to the grid search case. This algorithm enables the specification of searches | ||
over components in a pipeline as well as their parameters. | ||
|
||
Where the search estimator perfoming the 'extract' algorithm extracts an empty | ||
grid or distribution altogether for the given estimator, it should raise a | ||
``ValueError``, indicative of likely user error. | ||
jnothman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Backward compatibility | ||
---------------------- | ||
|
||
No concerns | ||
jnothman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Alternatives | ||
------------ | ||
|
||
The fundamental change here is to associate parameter search configuration with each atomic estimator object. | ||
|
||
Alternative APIs to do so include: | ||
|
||
* Provide a function ``set_grid`` as Searchgrid [1]_ does, which takes an | ||
estimator instance and a parameter space, and sets a private | ||
attribute on the estimator object. This avoids cluttering the estimator's | ||
method namespace. | ||
* Provide a `GridFactory` (see :issue:`21784`) which allows the user to | ||
construct a mapping from atomic estimator instances to their search spaces. | ||
Aside from not cluttering the estimator's namespace, this may have | ||
theoretical benefit in allowing the user to construct multiple search spaces | ||
for the same composite estimator. There are no known use cases for this | ||
benefit. | ||
|
||
Another questionable design is the separation of ``set_grid`` and ``set_distribution``. | ||
These could be combined into a single method (``set_search_space``?), such that | ||
:class:`~sklearn.model_selection.GridSearchCV` rejects a call to `fit` where `rvs` | ||
appear. This would make it harder to predefine search spaces that could be used | ||
for either exhaustive or randomised searches, which may be a use case in Auto-ML. | ||
jnothman marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. in those cases, they'd need to have separate calls to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What I'm trying to communicate here is that if we had a single There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure if I should change anything here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Inserted that clarification into the text. |
||
|
||
The design of using a keyword argument for each parameter in ``set_grid`` | ||
encourages succinctness but reduces extensibility. | ||
For example, we could design the API to require a single call per parameter:: | ||
|
||
est.set_grid('alpha', [.1, 1, 10]).set_grid('l1_ratio', [.2, .4, .6., .8]) | ||
|
||
This design would allow further parameters to be added to `set_grid` to enrich | ||
the use of this data, including whether or not it is intended for randomised search. | ||
jnothman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Discussion | ||
---------- | ||
|
||
There are several areas to extend upon the above changes, such as to allow | ||
easier construction of pipelines with alternative steps to be searched (see | ||
``searchgrid.make_pipeline``), and handling alternative steps having | ||
non-uniform distribution for randomised search. | ||
|
||
There are also several other limitatins of the proposed solution: | ||
|
||
Limitation: tied parameters | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Our solution does not directly meet the need for conditional | ||
dependencies within a single estimator raised in :issue:`9610`, e.g:: | ||
|
||
param_grid = [ | ||
{ | ||
"kernel": ["rbf"], | ||
"gamma": [.001, .0001], | ||
"C": [1, 10], | ||
}, | ||
{ | ||
"kernel": ["linear"], | ||
"C": [1, 10], | ||
} | ||
] | ||
|
||
Using the proposed API, the user would need to search over multiple instances | ||
of the estimator, setting the parameter grids that could be searched with | ||
conditional independence. | ||
jnothman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
That issue also raises a request to tie parameters across estimators. While | ||
the current proposal does not support this use case, the algorithm translating | ||
an estimator to its deep parameter grid/distribution could potentially be adjusted | ||
to recognise a ``TiedParam`` helper. | ||
|
||
Limitation: continued use of ``__`` for search parameters | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
While this proposal reduces the use of dunders (``__``) for specifying parameter | ||
spaces, they will still be rendered in ``*SearchCV``'s ``cv_results_`` attribute. | ||
``cv_results_`` is similarly affected by large changes to its keys when small | ||
changes are made to the composite model structure. Future work could provide | ||
tools to make ``cv_results_`` more accessible and invariant to model structure. | ||
Comment on lines
+354
to
+356
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. since this new way of providing the grid requires explicit act of the user via changing the value of |
||
|
||
References and Footnotes | ||
------------------------ | ||
|
||
.. [1] Joel Nothman (2017). *SearchGrid*. Software Release. | ||
https://searchgrid.readthedocs.io/ | ||
|
||
.. [2] Guillaume Chevalier, Alexandre Brilliant and Eric Hamel (2019). | ||
*Neuraxle - A Python Framework for Neat Machine Learning Pipelines*. | ||
DOI:10.13140/RG.2.2.33135.59043. Software at https://www.neuraxle.org/ | ||
|
||
Copyright | ||
--------- | ||
|
||
This document has been placed in the public domain. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're adding here 4 methods, aren't we? But I'm not sure if we want to introduce them here or after the example bellow. I'm okay generally with the text as is. Maybe we just want here to say we add 4 methods and let the details be left for later.