From c619fdc7cf4ef3a2feae3b6806c2bce1f00db261 Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Tue, 30 Nov 2021 22:06:36 +1100 Subject: [PATCH 01/13] Partial draft of SLEP014: parameter spaces on estimators --- slep014/proposal.rst | 200 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 200 insertions(+) create mode 100644 slep014/proposal.rst diff --git a/slep014/proposal.rst b/slep014/proposal.rst new file mode 100644 index 0000000..4a60962 --- /dev/null +++ b/slep014/proposal.rst @@ -0,0 +1,200 @@ +.. _slep_014: + +======================================= +SLEP014: parameter spaces on estimators +======================================= + +:Author: Joel Nothman +:Status: Draft +:Type: Standards Track +:Created: 2021-11-30 + +Abstract +-------- + +This proposes to simplify the specification of parameter searches by allowing +the user to store candidate values for each parameter on each estimator. +The ``*SearchCV`` estimators would then having a setting to extract the +parameter grid or distribution from a traversal of the supplied estimator. + +Detailed description +-------------------- + +The ability to set and get parameters from deep within nested estimators using +``get_params`` and ``set_params`` is powerful, but the specification of +parameter spaces to search can be very unfriendly for users. +In particular, the structure of the parameter grid specification needs to +reflect the structure of the estimator, with every path explicitly notated with +``__``-separated elements. + +For example, `one example `__ +proposes searching over alternative preprocessing steps in a Pipeline and their +parameters, as well as the parameters of the downstream classifier. + +:: + + from sklearn.pipeline import Pipeline + from sklearn.svm import LinearSVC + from sklearn.decomposition import PCA, NMF + from sklearn.feature_selection import SelectKBest, chi2 + + pipe = Pipeline( + [ + # the reduce_dim stage is populated by the param_grid + ("reduce_dim", "passthrough"), + ("classify", LinearSVC(dual=False, max_iter=10000)), + ] + ) + + N_FEATURES_OPTIONS = [2, 4, 8] + C_OPTIONS = [1, 10, 100, 1000] + param_grid = [ + { + "reduce_dim": [PCA(iterated_power=7)], + "reduce_dim__n_components": N_FEATURES_OPTIONS, + "classify__C": C_OPTIONS, + }, + { + "reduce_dim": [SelectKBest(chi2)], + "reduce_dim__k": N_FEATURES_OPTIONS, + "classify__C": C_OPTIONS, + }, + ] + +Here we see that in order to specify the search space for the 'k' parameter of +``SelectKBest``, the user needs to identify its fully qualified path from the +root estimator (``pipe``) that will be passed to the grid search estimator, +i.e. ``reduce_dim__k``. To construct this fully qualified parameter name, the +user must know that the ``SelectKBest`` estimator resides in a ``Pipeline`` +step named ``reduce_dim`` and that the Pipeline is not further nested in +another estimator. Changing the name of ``reduce_dim`` would entail a change to +5 lines in the above code snippet. + +We also see that the options for ``classify__C`` need to be specified twice. +Were the candidate values for ``C`` something that belonged to the +``LinearSVC`` estimator instance, rather than part of a grid specification, it +would be possible to specify it only once. The use of a list of two separate +dicts of parameter spaces is altogether avoidable, where the only reason the +parameter space is duplicated is to handle the alternation of one step in the +pipeline; for all other steps, it makes sense for the candidate parameter +space to remain constant regardless of whether ``reduce_dim`` is a feature +selector or a PCA. + +Here we propose to allow the user to specify candidates or distributions for +local parameters on a specific estimator estimator instance:: + + svc = LinearSVC(dual=False, max_iter=10000).set_grid(C=C_OPTIONS) + pipe = Pipeline( + [ + # the reduce_dim stage is populated by the param_grid + ("reduce_dim", "passthrough"), + ("classify", svc), + ] + ).set_grid(reduce_dim=[ + PCA(iterated_power=7).set_grid(n_components=N_FEATURES_OPTIONS), + SelectKBest().set_grid(k=N_FEATURES_OPTIONS), + ]) + +With this use of ``set_grid``, ``GridSearchCV(pipe)`` would not need the +parameter grid to be specified explicitly. Instead, a recursive descent through +``pipe``'s parameters allows it to reconstruct exactly the grid used in the +example above. + +Such functionality therefore allows users to: + +* easily define a parameter space together with the estimator they relate to, + improving code cohesion. +* establish a library of estimator configurations for reuse, reducing repeated + code, and reducing setup costs for auto-ML approaches. +* avoid work modifying the parameter space specification when a composite + estimator's strucutre is changed, or a Pipeline step is renamed. +* more comfortably specify search spaces that include the alternation of a + step in a Pipeline (or ColumnTransformer, etc.), creating a form of + conditional dependency in the search space. + +Implementation +-------------- + +TODO + +setter, getter for grid. +setter, getter for distribution. +Overwriting behaviour + +Private attribute on estimator, dynamically allocated on request + +Grid Search update to handle param_grid='extract' using the algorithm +and implementation from searchgrid [1]_. If an empty grid is extracted +Randomized Search update to handle param_distributions='extract', using ``get_grid`` +only to update the results of ``get_distribution``. + +Parameter spaces should be copied in clone, so that a user can overwrite only +one parameter's space without redefining everything. + +Backward compatibility +---------------------- + +No concerns + +Alternatives +------------ + +TODO + +no methhods, but storing on est +GridFactory (:issue:`21784`) + +Alternative syntaxes + +one call per param? +duplicate vs single method for grid vs distbn + +make_pipeline alternative or extension to avoid declaring 'passthrough' + +searchgrid [1]_, Neuraxle [2]_ + +Discussion +---------- + +raised in :issue:`19045`. + +:issue:`9610`: our solution does not directly meet the need for conditional +dependencies within a single estimator, e.g:: + + param_grid = [ + { + "kernel": ["rbf"], + "gamma": [.001, .0001], + "C": [1, 10], + }, + { + "kernel": ["linear"], + "C": [1, 10], + } + ] + +searchgrid's implementation was mentioned in relation to +https://github.com/scikit-learn/scikit-learn/issues/7707#issuecomment-392298478 + +Not handled: ``__`` paths still used in ``cv_results_`` + +This section may just be a bullet list including links to any discussions +regarding the SLEP: + +- This includes links to mailing list threads or relevant GitHub issues. + + +References and Footnotes +------------------------ + +.. [1] Joel Nothman (2017). *SearchGrid*. Software Release. + https://searchgrid.readthedocs.io/ + +.. [2] Guillaume Chevalier, Alexandre Brilliant and Eric Hamel (2019). + *Neuraxle - A Python Framework for Neat Machine Learning Pipelines*. + DOI:10.13140/RG.2.2.33135.59043. Software at https://www.neuraxle.org/ + +Copyright +--------- + +This document has been placed in the public domain. From 79b67ff0b0ef564a2557d3140ce91c050702aa5c Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Tue, 30 Nov 2021 22:08:46 +1100 Subject: [PATCH 02/13] typos --- slep014/proposal.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/slep014/proposal.rst b/slep014/proposal.rst index 4a60962..22b61cd 100644 --- a/slep014/proposal.rst +++ b/slep014/proposal.rst @@ -1,7 +1,7 @@ .. _slep_014: ======================================= -SLEP014: parameter spaces on estimators +SLEP014: Parameter Spaces on Estimators ======================================= :Author: Joel Nothman @@ -14,7 +14,7 @@ Abstract This proposes to simplify the specification of parameter searches by allowing the user to store candidate values for each parameter on each estimator. -The ``*SearchCV`` estimators would then having a setting to extract the +The ``*SearchCV`` estimators would then have a setting to construct the parameter grid or distribution from a traversal of the supplied estimator. Detailed description From 85e064bc8a87e3b136e22b2eec120469a855945c Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Wed, 1 Dec 2021 13:46:35 +1100 Subject: [PATCH 03/13] Apply suggestions from code review Co-authored-by: Andreas Mueller --- slep014/proposal.rst | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/slep014/proposal.rst b/slep014/proposal.rst index 22b61cd..31e2602 100644 --- a/slep014/proposal.rst +++ b/slep014/proposal.rst @@ -13,9 +13,9 @@ Abstract -------- This proposes to simplify the specification of parameter searches by allowing -the user to store candidate values for each parameter on each estimator. +the user to store candidate values for each parameter on the corresponding estimator. The ``*SearchCV`` estimators would then have a setting to construct the -parameter grid or distribution from a traversal of the supplied estimator. +parameter grid or distribution from the supplied estimator. Detailed description -------------------- @@ -24,7 +24,7 @@ The ability to set and get parameters from deep within nested estimators using ``get_params`` and ``set_params`` is powerful, but the specification of parameter spaces to search can be very unfriendly for users. In particular, the structure of the parameter grid specification needs to -reflect the structure of the estimator, with every path explicitly notated with +reflect the structure of the estimator, with every path explicitly specified by ``__``-separated elements. For example, `one example `__ @@ -74,13 +74,13 @@ We also see that the options for ``classify__C`` need to be specified twice. Were the candidate values for ``C`` something that belonged to the ``LinearSVC`` estimator instance, rather than part of a grid specification, it would be possible to specify it only once. The use of a list of two separate -dicts of parameter spaces is altogether avoidable, where the only reason the +dicts of parameter spaces is altogether avoidable, as the only reason the parameter space is duplicated is to handle the alternation of one step in the pipeline; for all other steps, it makes sense for the candidate parameter space to remain constant regardless of whether ``reduce_dim`` is a feature selector or a PCA. -Here we propose to allow the user to specify candidates or distributions for +This SLEP proposes to allow the user to specify candidates or distributions for local parameters on a specific estimator estimator instance:: svc = LinearSVC(dual=False, max_iter=10000).set_grid(C=C_OPTIONS) @@ -105,7 +105,8 @@ Such functionality therefore allows users to: * easily define a parameter space together with the estimator they relate to, improving code cohesion. * establish a library of estimator configurations for reuse, reducing repeated - code, and reducing setup costs for auto-ML approaches. + code, and reducing setup costs for auto-ML approaches. As such, this change + helps to enable :issue:`5004`. * avoid work modifying the parameter space specification when a composite estimator's strucutre is changed, or a Pipeline step is renamed. * more comfortably specify search spaces that include the alternation of a @@ -125,6 +126,7 @@ Private attribute on estimator, dynamically allocated on request Grid Search update to handle param_grid='extract' using the algorithm and implementation from searchgrid [1]_. If an empty grid is extracted +an error should be raised. Randomized Search update to handle param_distributions='extract', using ``get_grid`` only to update the results of ``get_distribution``. From 046f1089c6fba2b6a78042ad30e3537812845478 Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Wed, 1 Dec 2021 13:47:54 +1100 Subject: [PATCH 04/13] Add some discussion points from Andy's code review --- slep014/proposal.rst | 23 +++++++++++++++-------- 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/slep014/proposal.rst b/slep014/proposal.rst index 31e2602..1125a89 100644 --- a/slep014/proposal.rst +++ b/slep014/proposal.rst @@ -13,9 +13,9 @@ Abstract -------- This proposes to simplify the specification of parameter searches by allowing -the user to store candidate values for each parameter on the corresponding estimator. +the user to store candidate values for each parameter on each estimator. The ``*SearchCV`` estimators would then have a setting to construct the -parameter grid or distribution from the supplied estimator. +parameter grid or distribution from a traversal of the supplied estimator. Detailed description -------------------- @@ -24,7 +24,7 @@ The ability to set and get parameters from deep within nested estimators using ``get_params`` and ``set_params`` is powerful, but the specification of parameter spaces to search can be very unfriendly for users. In particular, the structure of the parameter grid specification needs to -reflect the structure of the estimator, with every path explicitly specified by +reflect the structure of the estimator, with every path explicitly notated with ``__``-separated elements. For example, `one example `__ @@ -74,13 +74,13 @@ We also see that the options for ``classify__C`` need to be specified twice. Were the candidate values for ``C`` something that belonged to the ``LinearSVC`` estimator instance, rather than part of a grid specification, it would be possible to specify it only once. The use of a list of two separate -dicts of parameter spaces is altogether avoidable, as the only reason the +dicts of parameter spaces is altogether avoidable, where the only reason the parameter space is duplicated is to handle the alternation of one step in the pipeline; for all other steps, it makes sense for the candidate parameter space to remain constant regardless of whether ``reduce_dim`` is a feature selector or a PCA. -This SLEP proposes to allow the user to specify candidates or distributions for +Here we propose to allow the user to specify candidates or distributions for local parameters on a specific estimator estimator instance:: svc = LinearSVC(dual=False, max_iter=10000).set_grid(C=C_OPTIONS) @@ -105,8 +105,7 @@ Such functionality therefore allows users to: * easily define a parameter space together with the estimator they relate to, improving code cohesion. * establish a library of estimator configurations for reuse, reducing repeated - code, and reducing setup costs for auto-ML approaches. As such, this change - helps to enable :issue:`5004`. + code, and reducing setup costs for auto-ML approaches. * avoid work modifying the parameter space specification when a composite estimator's strucutre is changed, or a Pipeline step is renamed. * more comfortably specify search spaces that include the alternation of a @@ -126,13 +125,14 @@ Private attribute on estimator, dynamically allocated on request Grid Search update to handle param_grid='extract' using the algorithm and implementation from searchgrid [1]_. If an empty grid is extracted -an error should be raised. Randomized Search update to handle param_distributions='extract', using ``get_grid`` only to update the results of ``get_distribution``. Parameter spaces should be copied in clone, so that a user can overwrite only one parameter's space without redefining everything. +expected behaviour when a parameter name with `__` is used. + Backward compatibility ---------------------- @@ -174,12 +174,19 @@ dependencies within a single estimator, e.g:: "C": [1, 10], } ] + +That issue also raises a request to tie parameters across estimators. While +the current proposal does not support this use case, the algorithm translating +an estimator to its deep parameter grid/distribution could potentially be adjusted +to recognise a ``TiedParam`` helper. searchgrid's implementation was mentioned in relation to https://github.com/scikit-learn/scikit-learn/issues/7707#issuecomment-392298478 Not handled: ``__`` paths still used in ``cv_results_`` +Non-uniform distributions on categorical values. + This section may just be a bullet list including links to any discussions regarding the SLEP: From 239ff8a196358859152b59254bdf04304ad0f57b Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Wed, 1 Dec 2021 13:49:35 +1100 Subject: [PATCH 05/13] Reintroduce edits after poor merge --- slep014/proposal.rst | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/slep014/proposal.rst b/slep014/proposal.rst index 1125a89..a95ce98 100644 --- a/slep014/proposal.rst +++ b/slep014/proposal.rst @@ -13,9 +13,9 @@ Abstract -------- This proposes to simplify the specification of parameter searches by allowing -the user to store candidate values for each parameter on each estimator. +the user to store candidate values for each parameter on the corresponding estimator. The ``*SearchCV`` estimators would then have a setting to construct the -parameter grid or distribution from a traversal of the supplied estimator. +parameter grid or distribution from the supplied estimator. Detailed description -------------------- @@ -24,7 +24,7 @@ The ability to set and get parameters from deep within nested estimators using ``get_params`` and ``set_params`` is powerful, but the specification of parameter spaces to search can be very unfriendly for users. In particular, the structure of the parameter grid specification needs to -reflect the structure of the estimator, with every path explicitly notated with +reflect the structure of the estimator, with every path explicitly specified by ``__``-separated elements. For example, `one example `__ @@ -74,13 +74,13 @@ We also see that the options for ``classify__C`` need to be specified twice. Were the candidate values for ``C`` something that belonged to the ``LinearSVC`` estimator instance, rather than part of a grid specification, it would be possible to specify it only once. The use of a list of two separate -dicts of parameter spaces is altogether avoidable, where the only reason the +dicts of parameter spaces is altogether avoidable, as the only reason the parameter space is duplicated is to handle the alternation of one step in the pipeline; for all other steps, it makes sense for the candidate parameter space to remain constant regardless of whether ``reduce_dim`` is a feature selector or a PCA. -Here we propose to allow the user to specify candidates or distributions for +This SLEP proposes to allow the user to specify candidates or distributions for local parameters on a specific estimator estimator instance:: svc = LinearSVC(dual=False, max_iter=10000).set_grid(C=C_OPTIONS) @@ -105,7 +105,8 @@ Such functionality therefore allows users to: * easily define a parameter space together with the estimator they relate to, improving code cohesion. * establish a library of estimator configurations for reuse, reducing repeated - code, and reducing setup costs for auto-ML approaches. + code, and reducing setup costs for auto-ML approaches. As such, this change + helps to enable :issue:`5004`. * avoid work modifying the parameter space specification when a composite estimator's strucutre is changed, or a Pipeline step is renamed. * more comfortably specify search spaces that include the alternation of a @@ -125,6 +126,7 @@ Private attribute on estimator, dynamically allocated on request Grid Search update to handle param_grid='extract' using the algorithm and implementation from searchgrid [1]_. If an empty grid is extracted +an error should be raised. Randomized Search update to handle param_distributions='extract', using ``get_grid`` only to update the results of ``get_distribution``. From 85419509233d6306802d44359a2f8035e25a47af Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Wed, 1 Dec 2021 14:37:02 +1100 Subject: [PATCH 06/13] Add docstring for set_grid --- slep014/proposal.rst | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) diff --git a/slep014/proposal.rst b/slep014/proposal.rst index a95ce98..1067122 100644 --- a/slep014/proposal.rst +++ b/slep014/proposal.rst @@ -118,6 +118,36 @@ Implementation TODO +Four public methods will be added to ``BaseEstimator``:: + + def set_grid(self, **grid: List[object]): + """Sets candidate values for parameters in a search + + These candidates are used in grid search when a paameter grid is not + explicitly specified. They are also used in randomized search in the + case where set_distribution has not been used for the corresponding + parameter. + + As with :meth:`set_params`, update semantics apply, such that + ``set_grid(param1=['a', 'b'], param2=[1, 2]).set_grid(param=['a'])`` + will retain the candidates set for ``param2``. To reset the grid, + each parameter's candidates should be set to ``[]``. + + Parameters + ---------- + grid : Dict[Str, List[object]] + Keyword arguments define the values to be searched for each + specified parameter. + + Keywords must be valid parameter names from :meth:`get_params`. + + Returns + ------- + self : Estimator + """ + ... + + setter, getter for grid. setter, getter for distribution. Overwriting behaviour From 05be687ed5b7832317209832fbd8bf771d690ad1 Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Tue, 4 Jan 2022 00:41:25 +1100 Subject: [PATCH 07/13] Attempt to complete Implementation section --- slep014/proposal.rst | 108 +++++++++++++++++++++++++++++++++++++------ 1 file changed, 94 insertions(+), 14 deletions(-) diff --git a/slep014/proposal.rst b/slep014/proposal.rst index 1067122..c841888 100644 --- a/slep014/proposal.rst +++ b/slep014/proposal.rst @@ -116,14 +116,12 @@ Such functionality therefore allows users to: Implementation -------------- -TODO - Four public methods will be added to ``BaseEstimator``:: def set_grid(self, **grid: List[object]): """Sets candidate values for parameters in a search - These candidates are used in grid search when a paameter grid is not + These candidates are used in grid search when a parameter grid is not explicitly specified. They are also used in randomized search in the case where set_distribution has not been used for the corresponding parameter. @@ -147,23 +145,103 @@ Four public methods will be added to ``BaseEstimator``:: """ ... + def get_grid(self): + """Retrieves current settings for parameters where search candidates are set + + Note that this only reflects local parameter candidates, and a grid + including nested estimators can be constructed in combination with + `get_params`. + + Returns + ------- + dict + A mapping from parameter name to a list of values. Each parameter + name should be a member of `self.get_params(deep=False).keys()`. + """ + ... + + def set_distribution(self, **distribution): + """Sets candidate values for parameters in a search + + These candidates are used in randomized search when a parameter + distribution is not explicitly specified. For parameters where + no distribution is defined and a grid is defined, those grid values + will also be used. + + As with :meth:`set_params`, update semantics apply, such that + ``set_distribution(param1=['a', 'b'], param2=[1, 2]).set_grid(param=['a'])`` + will retain the candidates set for ``param2``. To reset the grid, + each parameter's candidates should be set to ``[]``. + + Parameters + ---------- + distribution : mapping from str to RV or list + Keyword arguments define the distribution to be searched for each + specified parameter. + Distributions may be specified either as an object with the method + ``rvs`` (see :mod:`scipy.stats`) or a list of discrete values with + uniform distribution. + + Keywords must be valid parameter names from :meth:`get_params`. + + Returns + ------- + self : Estimator + """ + ... + + def get_distribution(self): + """Retrieves current settings for parameters where a search distribution is set + + Note that this only reflects local parameter candidates, and a joint distribution + including nested estimators can be constructed in combination with + `get_params`. -setter, getter for grid. -setter, getter for distribution. -Overwriting behaviour + For parameters where ``set_distribution`` has not been used, but ``set_grid`` + has been, this will return the corresponding list of values specified in + ``set_grid``. -Private attribute on estimator, dynamically allocated on request + Returns + ------- + dict + A mapping from parameter name to a scipy-compatible distribution + (i.e. with ``rvs``` method) or list of discrete values. Each parameter + name should be a member of `self.get_params(deep=False).keys()`. + """ + ... -Grid Search update to handle param_grid='extract' using the algorithm -and implementation from searchgrid [1]_. If an empty grid is extracted -an error should be raised. -Randomized Search update to handle param_distributions='extract', using ``get_grid`` -only to update the results of ``get_distribution``. +The current distribution and grid values will be stored in a private +attribute on the estimator, and ``get_grid`` may simply return this value, +or an empty dict if undefined, while ``get_distribution`` will combine the +stored attribute with ``get_grid`` values. +The attribute will be undefined by default upon construction of the estimator. Parameter spaces should be copied in clone, so that a user can overwrite only one parameter's space without redefining everything. - -expected behaviour when a parameter name with `__` is used. +To facilitate this (in the absence of a polymorphic implementation of clone), +we might need to store the candidate grids and distributions in a known instance +attribute, or use a combination of `get_grid`, `get_distribution`, `get_params` +and `set_grid`, `set_distribution` etc. to perform `clone`. + +Search estimators in `sklearn.model_selection` will be updated such that the +required `param_grid` and `param_distributions` parameters will now default +to 'extract'. The 'extract' value instructs the search estimator to construct +a complete search space from the provided estimator's `get_grid` (or +`get_distribution`) return value together with `get_params`. +It recursively calls `get_grid` (and `get_distribution`) on any parametrized +objects (i.e. those with `get_params`) with this method that are descendent +from the given estimator, including: +* values in ``estimator.get_params(deep=True)`` +* elements of list values in ``x.get_grid()`` or ``x.get_distribution()`` + as appropriate (disregarding rvs) for any `x` descendant of the estimator. + +See the implementation of ``build_param_grid`` in Searchgrid [1]_, which applies +to the grid search case. This algorithm enables the specification of searches +over components in a pipeline as well as their parameters. + +Where the search estimator perfoming the 'extract' algorithm extracts an empty +grid or distribution altogether for the given estimator, it should raise a +``ValueError``, indicative of likely user error. Backward compatibility ---------------------- @@ -190,6 +268,8 @@ searchgrid [1]_, Neuraxle [2]_ Discussion ---------- +TODO + raised in :issue:`19045`. :issue:`9610`: our solution does not directly meet the need for conditional From f39e095453082d1ed4eb611fdb5113931c1c7eed Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Wed, 2 Feb 2022 15:05:02 +1100 Subject: [PATCH 08/13] Correct SLEP number to 016 --- {slep014 => slep016}/proposal.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) rename {slep014 => slep016}/proposal.rst (99%) diff --git a/slep014/proposal.rst b/slep016/proposal.rst similarity index 99% rename from slep014/proposal.rst rename to slep016/proposal.rst index c841888..e5aa9b5 100644 --- a/slep014/proposal.rst +++ b/slep016/proposal.rst @@ -1,7 +1,7 @@ -.. _slep_014: +.. _slep_016: ======================================= -SLEP014: Parameter Spaces on Estimators +SLEP016: Parameter Spaces on Estimators ======================================= :Author: Joel Nothman From e6c61c4b4d8184a83131e929f35d3fccc11d179a Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Fri, 18 Mar 2022 11:52:18 +1100 Subject: [PATCH 09/13] Complete draft of the SLEP --- slep016/proposal.rst | 107 ++++++++++++++++++++++++++++--------------- 1 file changed, 70 insertions(+), 37 deletions(-) diff --git a/slep016/proposal.rst b/slep016/proposal.rst index e5aa9b5..20104d9 100644 --- a/slep016/proposal.rst +++ b/slep016/proposal.rst @@ -67,21 +67,20 @@ root estimator (``pipe``) that will be passed to the grid search estimator, i.e. ``reduce_dim__k``. To construct this fully qualified parameter name, the user must know that the ``SelectKBest`` estimator resides in a ``Pipeline`` step named ``reduce_dim`` and that the Pipeline is not further nested in -another estimator. Changing the name of ``reduce_dim`` would entail a change to -5 lines in the above code snippet. +another estimator. Changing the step identifier ``reduce_dim`` would entail +a change to 5 lines in the above code snippet. We also see that the options for ``classify__C`` need to be specified twice. -Were the candidate values for ``C`` something that belonged to the -``LinearSVC`` estimator instance, rather than part of a grid specification, it -would be possible to specify it only once. The use of a list of two separate -dicts of parameter spaces is altogether avoidable, as the only reason the +It should be possible to specify it only once. The use of a list of two separate +dicts of parameter spaces is similarly cumbersome: the only reason the parameter space is duplicated is to handle the alternation of one step in the pipeline; for all other steps, it makes sense for the candidate parameter space to remain constant regardless of whether ``reduce_dim`` is a feature selector or a PCA. -This SLEP proposes to allow the user to specify candidates or distributions for -local parameters on a specific estimator estimator instance:: +This SLEP proposes to add a methods to estimators that allow the user +to specify candidates or distributions for local parameters on a specific +estimator estimator instance:: svc = LinearSVC(dual=False, max_iter=10000).set_grid(C=C_OPTIONS) pipe = Pipeline( @@ -113,6 +112,18 @@ Such functionality therefore allows users to: step in a Pipeline (or ColumnTransformer, etc.), creating a form of conditional dependency in the search space. +History +------- + +:issue:`5082`, :issue:`7608` and :issue:`19045` have all raised associating +parameter search spaces directly with an estimator instance, while this +has been supported by third party packages [1]_, [2]_. :issue:`21784` proposed +a ``GridFactory``, but feedback suggested that methods on each estimator +was more usable than an external utility. + +This proposal pertains to the Scikit-learn Roadmap entry "Better support for +manual and automatic pipeline building" dating back to 2018. + Implementation -------------- @@ -213,20 +224,20 @@ Four public methods will be added to ``BaseEstimator``:: The current distribution and grid values will be stored in a private attribute on the estimator, and ``get_grid`` may simply return this value, or an empty dict if undefined, while ``get_distribution`` will combine the -stored attribute with ``get_grid`` values. +stored parameter distributions with ``get_grid`` values. The attribute will be undefined by default upon construction of the estimator. -Parameter spaces should be copied in clone, so that a user can overwrite only -one parameter's space without redefining everything. +Parameter spaces should be copied in :ojb:`sklearn.base.clone`, so that a user +can overwrite only one parameter's space without redefining everything. To facilitate this (in the absence of a polymorphic implementation of clone), we might need to store the candidate grids and distributions in a known instance attribute, or use a combination of `get_grid`, `get_distribution`, `get_params` and `set_grid`, `set_distribution` etc. to perform `clone`. Search estimators in `sklearn.model_selection` will be updated such that the -required `param_grid` and `param_distributions` parameters will now default +currently required `param_grid` and `param_distributions` parameters will now default to 'extract'. The 'extract' value instructs the search estimator to construct -a complete search space from the provided estimator's `get_grid` (or +a complete search space from the provided estimator's `get_grid` (respectively, `get_distribution`) return value together with `get_params`. It recursively calls `get_grid` (and `get_distribution`) on any parametrized objects (i.e. those with `get_params`) with this method that are descendent @@ -251,29 +262,51 @@ No concerns Alternatives ------------ -TODO +The fundamental change here is to associate parameter search configuration with each atomic estimator object. -no methhods, but storing on est -GridFactory (:issue:`21784`) +Alternative APIs to do so include: -Alternative syntaxes +* Provide a function ``set_grid`` as Searchgrid [1]_ does, which takes an + estimator instance and a parameter space, and sets a private + attribute on the estimator object. This avoids cluttering the estimator's + method namespace. +* Provide a `GridFactory` (see :issue:`21784`) which allows the user to + construct a mapping from atomic estimator instances to their search spaces. + Aside from not cluttering the estimator's namespace, this may have + theoretical benefit in allowing the user to construct multiple search spaces + for the same composite estimator. There are no known use cases for this + benefit. -one call per param? -duplicate vs single method for grid vs distbn +Another questionable design is the separation of ``set_grid`` and ``set_distribution``. +These could be combined into a single method (``set_search_space``?), such that +:class:`~sklearn.model_selection.GridSearchCV` rejects a call to `fit` where `rvs` +appear. This would make it harder to predefine search spaces that could be used +for either exhaustive or randomised searches, which may be a use case in Auto-ML. -make_pipeline alternative or extension to avoid declaring 'passthrough' +The design of using a keyword argument for each parameter in ``set_grid`` +encourages succinctness but reduces extensibility. +For example, we could design the API to require a single call per parameter:: -searchgrid [1]_, Neuraxle [2]_ + est.set_grid('alpha', [.1, 1, 10]).set_grid('l1_ratio', [.2, .4, .6., .8]) + +This design would allow further parameters to be added to `set_grid` to enrich +the use of this data, including whether or not it is intended for randomised search. Discussion ---------- -TODO +There are several areas to extend upon the above changes, such as to allow +easier construction of pipelines with alternative steps to be searched (see +``searchgrid.make_pipeline``), and handling alternative steps having +non-uniform distribution for randomised search. + +There are also several other limitatins of the proposed solution: -raised in :issue:`19045`. +Limitation: tied parameters +~~~~~~~~~~~~~~~~~~~~~~~~~~~ -:issue:`9610`: our solution does not directly meet the need for conditional -dependencies within a single estimator, e.g:: +Our solution does not directly meet the need for conditional +dependencies within a single estimator raised in :issue:`9610`, e.g:: param_grid = [ { @@ -286,24 +319,24 @@ dependencies within a single estimator, e.g:: "C": [1, 10], } ] - + +Using the proposed API, the user would need to search over multiple instances +of the estimator, setting the parameter grids that could be searched with +conditional independence. + That issue also raises a request to tie parameters across estimators. While the current proposal does not support this use case, the algorithm translating an estimator to its deep parameter grid/distribution could potentially be adjusted to recognise a ``TiedParam`` helper. -searchgrid's implementation was mentioned in relation to -https://github.com/scikit-learn/scikit-learn/issues/7707#issuecomment-392298478 - -Not handled: ``__`` paths still used in ``cv_results_`` - -Non-uniform distributions on categorical values. - -This section may just be a bullet list including links to any discussions -regarding the SLEP: - -- This includes links to mailing list threads or relevant GitHub issues. +Limitation: continued use of ``__`` for search parameters +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +While this proposal reduces the use of dunders (``__``) for specifying parameter +spaces, they will still be rendered in ``*SearchCV``'s ``cv_results_`` attribute. +``cv_results_`` is similarly affected by large changes to its keys when small +changes are made to the composite model structure. Future work could provide +tools to make ``cv_results_`` more accessible and invariant to model structure. References and Footnotes ------------------------ From 45adda4aadb4da7e7d325d94259c3a7e6fe74928 Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Fri, 29 Dec 2023 11:59:01 +1100 Subject: [PATCH 10/13] Comment on aidss approach --- slep016/proposal.rst | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/slep016/proposal.rst b/slep016/proposal.rst index 20104d9..e2be64d 100644 --- a/slep016/proposal.rst +++ b/slep016/proposal.rst @@ -271,11 +271,24 @@ Alternative APIs to do so include: attribute on the estimator object. This avoids cluttering the estimator's method namespace. * Provide a `GridFactory` (see :issue:`21784`) which allows the user to - construct a mapping from atomic estimator instances to their search spaces. + construct a mapping from atomic estimator instances (and potentially estimator + classes as a fallback) to their search spaces. Aside from not cluttering the estimator's namespace, this may have theoretical benefit in allowing the user to construct multiple search spaces for the same composite estimator. There are no known use cases for this - benefit. + benefit. This approach cannot retain the parameter space for a cloned estimator, + potentially leading to surprising behavior. +* In the vein of `GridFactory`, but without a new object-oriented API: + Provide a helper function which takes a mapping of estimator instances + (and perhaps classes as a fall-back) to a shallow parameter search space, and + transforms it into a traditional parameter grid. + This helper function could be public, or else this instance-space mapping would + become a new, *additional* way of specifying a parameter grid to `*SearchCV`. + Inputs in this format would automatically be converted to traditional parameter + grids. This has similar benefits and downsides as `GridFactory`, while avoiding + introducing a new API and instead relying on plain old Python dicts. + Having multiple distinct dict-based representations of parameter spaces is + likely to confuse users. Another questionable design is the separation of ``set_grid`` and ``set_distribution``. These could be combined into a single method (``set_search_space``?), such that From 748fa31a162820990112001b9a3f15d73f779b73 Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Fri, 29 Dec 2023 12:22:28 +1100 Subject: [PATCH 11/13] address review comments --- slep016/proposal.rst | 53 ++++++++++++++++++++++---------------------- 1 file changed, 26 insertions(+), 27 deletions(-) diff --git a/slep016/proposal.rst b/slep016/proposal.rst index e2be64d..c503728 100644 --- a/slep016/proposal.rst +++ b/slep016/proposal.rst @@ -82,19 +82,19 @@ This SLEP proposes to add a methods to estimators that allow the user to specify candidates or distributions for local parameters on a specific estimator estimator instance:: - svc = LinearSVC(dual=False, max_iter=10000).set_grid(C=C_OPTIONS) + svc = LinearSVC(dual=False, max_iter=10000).set_search_grid(C=C_OPTIONS) pipe = Pipeline( [ # the reduce_dim stage is populated by the param_grid ("reduce_dim", "passthrough"), ("classify", svc), ] - ).set_grid(reduce_dim=[ - PCA(iterated_power=7).set_grid(n_components=N_FEATURES_OPTIONS), - SelectKBest().set_grid(k=N_FEATURES_OPTIONS), + ).set_search_grid(reduce_dim=[ + PCA(iterated_power=7).set_search_grid(n_components=N_FEATURES_OPTIONS), + SelectKBest().set_search_grid(k=N_FEATURES_OPTIONS), ]) -With this use of ``set_grid``, ``GridSearchCV(pipe)`` would not need the +With this use of ``set_search_grid``, ``GridSearchCV(pipe)`` would not need the parameter grid to be specified explicitly. Instead, a recursive descent through ``pipe``'s parameters allows it to reconstruct exactly the grid used in the example above. @@ -129,16 +129,19 @@ Implementation Four public methods will be added to ``BaseEstimator``:: - def set_grid(self, **grid: List[object]): + def set_search_grid(self, **grid: List[object]): """Sets candidate values for parameters in a search These candidates are used in grid search when a parameter grid is not explicitly specified. They are also used in randomized search in the - case where set_distribution has not been used for the corresponding + case where set_search_rvs has not been used for the corresponding parameter. + Note that this parameter space has no effect when the estimator's own + ``fit`` method is called, but can be used by model selection utilities. + As with :meth:`set_params`, update semantics apply, such that - ``set_grid(param1=['a', 'b'], param2=[1, 2]).set_grid(param=['a'])`` + ``set_search_grid(param1=['a', 'b'], param2=[1, 2]).set_search_grid(param=['a'])`` will retain the candidates set for ``param2``. To reset the grid, each parameter's candidates should be set to ``[]``. @@ -171,7 +174,7 @@ Four public methods will be added to ``BaseEstimator``:: """ ... - def set_distribution(self, **distribution): + def set_search_rvs(self, **distribution): """Sets candidate values for parameters in a search These candidates are used in randomized search when a parameter @@ -180,7 +183,7 @@ Four public methods will be added to ``BaseEstimator``:: will also be used. As with :meth:`set_params`, update semantics apply, such that - ``set_distribution(param1=['a', 'b'], param2=[1, 2]).set_grid(param=['a'])`` + ``set_search_rvs(param1=['a', 'b'], param2=[1, 2]).set_search_grid(param=['a'])`` will retain the candidates set for ``param2``. To reset the grid, each parameter's candidates should be set to ``[]``. @@ -208,9 +211,9 @@ Four public methods will be added to ``BaseEstimator``:: including nested estimators can be constructed in combination with `get_params`. - For parameters where ``set_distribution`` has not been used, but ``set_grid`` + For parameters where ``set_search_rvs`` has not been used, but ``set_search_grid`` has been, this will return the corresponding list of values specified in - ``set_grid``. + ``set_search_grid``. Returns ------- @@ -232,7 +235,7 @@ can overwrite only one parameter's space without redefining everything. To facilitate this (in the absence of a polymorphic implementation of clone), we might need to store the candidate grids and distributions in a known instance attribute, or use a combination of `get_grid`, `get_distribution`, `get_params` -and `set_grid`, `set_distribution` etc. to perform `clone`. +and `set_search_grid`, `set_search_rvs` etc. to perform `clone`. Search estimators in `sklearn.model_selection` will be updated such that the currently required `param_grid` and `param_distributions` parameters will now default @@ -250,9 +253,11 @@ See the implementation of ``build_param_grid`` in Searchgrid [1]_, which applies to the grid search case. This algorithm enables the specification of searches over components in a pipeline as well as their parameters. -Where the search estimator perfoming the 'extract' algorithm extracts an empty +If the search estimator perfoming the 'extract' algorithm extracts an empty grid or distribution altogether for the given estimator, it should raise a -``ValueError``, indicative of likely user error. +`ValueError`, indicative of likely user error. Note that this allows a step in a +`Pipeline` to have an empty search space as long as at least one step of that +`Pipeline` defines a non-empty search space. Backward compatibility ---------------------- @@ -266,7 +271,7 @@ The fundamental change here is to associate parameter search configuration with Alternative APIs to do so include: -* Provide a function ``set_grid`` as Searchgrid [1]_ does, which takes an +* Provide a function ``set_search_grid`` as Searchgrid [1]_ does, which takes an estimator instance and a parameter space, and sets a private attribute on the estimator object. This avoids cluttering the estimator's method namespace. @@ -288,22 +293,16 @@ Alternative APIs to do so include: grids. This has similar benefits and downsides as `GridFactory`, while avoiding introducing a new API and instead relying on plain old Python dicts. Having multiple distinct dict-based representations of parameter spaces is - likely to confuse users. -Another questionable design is the separation of ``set_grid`` and ``set_distribution``. -These could be combined into a single method (``set_search_space``?), such that +Another questionable design is the separation of ``set_search_grid`` and ``set_search_rvs``. +These could be combined into a single method, such that :class:`~sklearn.model_selection.GridSearchCV` rejects a call to `fit` where `rvs` appear. This would make it harder to predefine search spaces that could be used for either exhaustive or randomised searches, which may be a use case in Auto-ML. -The design of using a keyword argument for each parameter in ``set_grid`` -encourages succinctness but reduces extensibility. -For example, we could design the API to require a single call per parameter:: - - est.set_grid('alpha', [.1, 1, 10]).set_grid('l1_ratio', [.2, .4, .6., .8]) - -This design would allow further parameters to be added to `set_grid` to enrich -the use of this data, including whether or not it is intended for randomised search. +Another possible consideration is whether `set_search_grid` should update rather than +replace the existing search space, to allow for incremental construction. This is likely +to confuse users more than help. Discussion ---------- From 0c946c7a703bfdf849fae77239e6b555dfb4f4cc Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Fri, 29 Dec 2023 12:29:32 +1100 Subject: [PATCH 12/13] address reviews --- slep016/proposal.rst | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/slep016/proposal.rst b/slep016/proposal.rst index c503728..5df67e2 100644 --- a/slep016/proposal.rst +++ b/slep016/proposal.rst @@ -228,7 +228,9 @@ The current distribution and grid values will be stored in a private attribute on the estimator, and ``get_grid`` may simply return this value, or an empty dict if undefined, while ``get_distribution`` will combine the stored parameter distributions with ``get_grid`` values. -The attribute will be undefined by default upon construction of the estimator. +The attribute will be undefined by default upon construction of the estimator, +though in the future we could consider default grids being specified for +some estimator classes. Parameter spaces should be copied in :ojb:`sklearn.base.clone`, so that a user can overwrite only one parameter's space without redefining everything. @@ -262,7 +264,11 @@ grid or distribution altogether for the given estimator, it should raise a Backward compatibility ---------------------- -No concerns +Where the user specifies an explicit grid, but one is also stored on the estimator +using `set_search_grid`, we will adopt legacy behaviour, and search with the +explicitly provided grid, maintaining backwards compatibility, and allowing a +manual override of the new behaviour. This behavior will be made clear in the +docuemntation of parameters like `param_grid` and `param_distributions`. Alternatives ------------ @@ -300,9 +306,9 @@ These could be combined into a single method, such that appear. This would make it harder to predefine search spaces that could be used for either exhaustive or randomised searches, which may be a use case in Auto-ML. -Another possible consideration is whether `set_search_grid` should update rather than -replace the existing search space, to allow for incremental construction. This is likely -to confuse users more than help. +Another possible alternative is to have `set_search_grid` update rather than +replace the existing search space, to allow for incremental construction. This is +likely to confuse users more than help. Discussion ---------- From bb56429a55298aa2bdca40dff7b40d9294acd00e Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Thu, 21 Mar 2024 09:32:14 +1100 Subject: [PATCH 13/13] Add clarifcation --- slep016/proposal.rst | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/slep016/proposal.rst b/slep016/proposal.rst index 5df67e2..dd23fad 100644 --- a/slep016/proposal.rst +++ b/slep016/proposal.rst @@ -305,10 +305,9 @@ These could be combined into a single method, such that :class:`~sklearn.model_selection.GridSearchCV` rejects a call to `fit` where `rvs` appear. This would make it harder to predefine search spaces that could be used for either exhaustive or randomised searches, which may be a use case in Auto-ML. - -Another possible alternative is to have `set_search_grid` update rather than -replace the existing search space, to allow for incremental construction. This is -likely to confuse users more than help. +That is, if we had a single set_search_space, an AutoML library that is set up to call it +would have to choose between setting RVs and grids. But this presumes a lot about how an +AutoML library using this API might look. Discussion ----------