diff --git a/index.rst b/index.rst index ff7d43c..77ba421 100644 --- a/index.rst +++ b/index.rst @@ -25,6 +25,7 @@ slep012/proposal slep017/proposal slep019/proposal + slep022/proposal .. toctree:: :maxdepth: 1 diff --git a/slep022/proposal.rst b/slep022/proposal.rst new file mode 100644 index 0000000..ef0ce28 --- /dev/null +++ b/slep022/proposal.rst @@ -0,0 +1,703 @@ +.. _slep_022: + +====================================== +SLEP022: Fixing randomness ambiguities +====================================== + +:Author: Needs a champion +:Status: Under review +:Type: Standards Track +:Created: 2020-09-01 + +.. warning:: + + Note: this SLEP is a complete re-write of the abandonned `SLEP 11 + `_ and + proposes a **different** solution. + + Play around with `this notebook + `_ + which illustrates the proposed solution. + + Please, **please** take some time to read and understand this SLEP + before taking part in discussions. + + +Abstract +======== + +This SLEP aims at fixing various issues related to how scikit-learn handles +randomness in estimators and Cross-Validation (CV) splitters. The +`random_state` design makes estimators and splitters stateful across calls to +`fit` and `split`, which induces a complex mental model for developers and +users. Various issues are: + + - Implicit behavior of CV procedures and meta-estimators: the estimator's + `random_state` parameter implicitly (but significantly) influences the + behaviour of seemingly unrelated procedures like `cross_val_score`, or + meta-estimators like `RFE` and `SequentialFeatureSelection`. To understand + these subtleties, users need to know inner details of these CV procedures + and meta-estimators: namely, they need to know *when* `fit` and `clone` + are called. + - Similarly, users are required to know how many times `split` is called in + a CV procedure (e.g `GridSearchCV.fit`), or there is a risk that they may + make unsound performance comparison. + - hard-to-find bugs, due to the insiduous ways in which `random_state` + affects estimators and splitters. + +These issues are described in more details in a dedicated section below. This +SLEP is *not* about: + +- what the default value of `random_state` should be +- whether we should allow RandomState instances and None to be passed. + +**The proposed solution is to**: + +- Make estimators and splitter stateless, by drawing a random seed in + `__init__`. This seed is used as the sole source of RNG in `fit` and in + `split`. Neither `fit` nor `split` need to be changed. This will + allow users and developers to reason about `random_state` with a simpler + mental model. +- Let users explicitly choose the strategy executed by CV procedures and + meta-estimators, by introducing new parameters in those procedures, where + relevant. A candidate parameter name is `use_exact_clones` (name open for + discussion). + + This will **explicitly** expose the choice of the CV and meta-estimator + strategy to users, instead of **implicitly** relying on the type of the + estimator's `random_state`. Moreover, unlike in the legacy design, all + strategies will be supported by any estimator, no matter the type of the + `random_state` attribute. + +.. note:: + The *legacy* design refers to the design as used in the current + scikit-learn master branch, i.e. `0.24.dev`. + +Issues with the legacy design +============================= + +TL;DR: nobody, `not even core devs +`_, understands how +randomness works in scikit-learn. We think we do, but we don't. + +There are two key problems, that this SLEP solves: + +- Estimators are stateful across calls to `fit()`. +- Users can't properly control what kind of CV procedure their code is running. + +More details below in :ref:`issues`. + + +.. _key_use_cases: + +Key use-cases and requirements +============================== + +There are a few use-cases that are made possible by the legacy design and +that we will want to keep supporting in the new design. We will refer to +these use-cases in the rest of the document: + +- A. Allow for consistent results across executions. This is a natural + requirement for any implementation: we want users to be able to get + consistently reproducible results across multiple program executions. This + is currently supported by removing any use of `random_state=None`. +- B. Allow for maximum variability and different results across executions. + This is currently supported by passing `random_state=None`: multiple + program executions yield different results each time. +- C. CV procedures where the estimator's RNG is exactly the same on each + fold. This is useful when one wants to make sure that a given seed will + work well on their data. This is a legit requirement as long as the user is OK + for the estimator to perform well just by chance. This is currently supported + by passing an int + **or a `RandomState` instance** to the `random_state` parameter of the + estimator. +- D. CV procedures where the estimator's RNG is different for each fold. This + is useful to increase the statistical significance of CV results. This is + currently **only** supported by passing None as the `random_state` parameter + of the estimator - and that throws reproducibilty out the window. **In the + legacy design, A and D are incompatible.** +- E. Obtain *strict* clones, i.e. clones that will yield exactly the same + results when called on the same data. This is currently supported by + calling `clone` if the estimator's `random_state` is an int or a `RandomState` + instance. +- F. Obtain *statistical* clones, i.e. estimators + that are identical but that will yield different results, even when called + on the same data. This is currently supported by calling `clone` if + the estimator's `random_state` is None. **F and A are incompatible.** +- G. Easily obtain different splits with the same characteristics from a + splitter. By "same characteristics", we mean same number of splits, same + choice of stratification, etc. This is useful to implement e.g. + boostrapping. This is currently supported by calling `split` multiple times + on the same `cv` instance, if `cv.random_state` is an instance, or None. + This can also be achieved by just creating multiple splitter instances with + different `random_state` values (which could be integers). + +.. note:: + C and E are very related: in `cross_val_score`, C is supported by + creating strict clones (E). Similarly, D is supported by creating + statistical clones (F). + + In the legacy design, C and D are mutually exclusive: a given estimator + can only do C and not D, or only D and not C. Also, a given estimator can + only produce strict clones or only statistical clones, but not both. In + the proposed design, all estimators will support both C and D. Similarly, + strict and statistical clones can be obtained from any estimator. + +Proposed Solution +================= + +.. note:: + This proposed solution is a work in progress and there is room for + improvement. Feel free to suggest any. + +We want to make estimators and splitter stateless, while also avoiding the +ambiguity of CV procedures and meta-estimators. We also want to keep +supporting all the aforementioned use-cases in some way. + +.. note:: + A toy implementation of the proposed solution with illustration snippets + is available in `this notebook + `_. + +The proposed solution is to sample a seed from `random_state` at `__init__` +in estimators and splitters:: + + def _sample_seed(random_state): + # sample a random seed to be stored as the random_state attribute + # ints are passed-through + if isinstance(random_state, int): + return random_state + else: + return check_random_state(random_state).randint(0, 2**32) + + class MyEstimator(BaseEstimator): + def __init__(self, random_state=None): + self.random_state = _sample_seed(random_state) + + def set_params(self, random_state=None): + self.random_state = _sample_seed(random_state) + + class MySplitter(BaseSplitter): + def __init__(self, random_state=None): + self.random_state = _sample_seed(random_state) + +`fit`, `split`, and `get_params` are unchanged. + +In order to **explicitly** support use-cases C and D, CV procedures like +`cross_val_score` should be updated with a new `use_exact_clones` parameter:: + + def _check_statistical_clone_possible(est): + if 'random_state' not in est.get_params(): + raise ValueError("This estimator isn't random and can only have exact clones") + + def cross_val_score(est, X, y, cv, use_exact_clones=True): + # use_exact_clones: + # - if True, the same estimator RNG is used on each fold (use-case C) + # - if False, the different estimator RNGs are used on each fold (use-case D) + + if use_exact_clones: + statistical = False + else: + # need a local RNG so that clones have different random_state attributes + _check_statistical_clone_possible(est) + statistical = np.random.RandomState(est.random_state) + + return [ + clone(est, statistical=statistical) + .fit(X[train], y[train]) + .score(X[test], y[test]) + for train, test in cv.split(X, y) + ] + +.. note:: + The name `use_exact_clones` is just a placeholder for now, and the final + name is up for discussion. A more descriptive name for `cross_val_score` + could be e.g. `estimator_randomness={'constant', 'splitwise'}`. The name + doesn't have to be the same throughout all CV procedures, and itshould + accurately describe the alternative strategies that are possible. + +Meta-estimators should be updated in a similar fashion to make their two +alternative behaviors explicit. + +As can be seen from the above snippet, the `clone` function needs to be +updated so that one can explicitly request exact or statistical clones +(use-cases E and F):: + + def clone(est, statistical=False): + # Return a strict clone or a statistical clone. + + # statistical parameter can be: + # - False: a strict clone is returned + # - True: a statistical clone is returned. Its RNG is seeded from `est` + # - None, int, or RandomState instance: a statistical clone is returned. + # Its RNG is seeded from `statistical`. This is useful to + # create multiple statistical clones that don't have the same RNG + + params = est.get_params() + + if statistical is not False: + # A statistical clone is a clone with a different random_state attribute + _check_statistical_clone_possible(est) + rng = params['random_state'] if statistical is True else statistical + params['random_state'] = _sample_seed(check_random_state(rng)) + + return est.__class__(**params) + +Note how one can pass a RandomState instance as the `statistical` parameter, +in order to obtain a sequence of estimators that have different RNGs. This is +used in particular in the above `cross_val_score`. + +Use-cases support +----------------- + +Use-cases A and B are supported just like before. + +Use-cases C, D, E, F are explicitly supported *and* can be achieved by any +estimator, no matter its `random_state`. The legacy design can only +(implicitly) support either C and E or D and F. + +Use-case G can be explicitly supported by creating multiple CV instances, +each with a different `random_state`:: + + rng = np.RandomState(0) + cvs = [KFold(random_state=rng) for _ in range(n_bootstrap_iterations)] + +CV instances are extremely cheap to create and to store. Alternatively, we +can introduce the notion of statistical clone for splitters, and let `clone` +support splitters as well. This is however more involved. + +Advantages +---------- + +- Users do not need to know the internals of CV procedures or meta estimators + anymore. Any potential ambiguity is now made explicit and exposed to them + by a new parameter, which will also make documentation easier. + *Removing ambiguity is the main point of this SLEP*. + +- The mental model is simpler: no matter what is passed as `random_state` + (int, RandomState instances, or None), results are constant across calls to + `fit` and `split`. The RandomState instance is mutated **once** (and only + once) in `__init__`. Bugs will be less likely to sneak in. + +- Estimators and splitters are stateless, and `fit` is now properly + idempotent. + +- Users now have explicit control on the CV procedures and meta-estimators, + instead of implicitly relying on the estimator's `random_state`. These + procedures are not ambiguous anymore. + +- Since CV splitters always yield the same splits, the chances of performing + erroneous score comparison is limited. + +- `fit`, `split`, and `get_params` are unchanged. + +Drawbacks +--------- + +- We break our convention that `__init__` should only ever store attributes, + as they are passed in (for integers, this convention is still respected). + Note however that the reason we have this convention is that we want the + semantic of `__init__` and `set_params` to be the same. **This is still + respected here.** So this isn't really an issue. + +- CV procedures and meta-estimators must be updated. There is however no way + around this, if we want to explicitly expose the different possible + strategies. + +Backward compatibility and possible roadmap for adoption +-------------------------------------------------------- + +.. warning:: + + I wrote this a long time ago. BC considerations might be easier / looser if + we choose to only support this proposal on the new numpy RNG objects. I + haven't thought about it in depth. + +There are two main changes proposed here: + +1. Making estimators and splitters stateless +2. Introducing the `use_exact_clones` (or alternative names) parameter to CV + procedures and meta-estimators. + +A possible deprecation path for 1) would be to warn the user the second time +`fit` (or `split`) is called, when `random_state` is an instance. There is no +need for a warning if `random_state` is an integer since estimators and +splitters are already stateless in this case. The warning would suggest users +to explicitly create a statistical clone (using `clone`) instead of calling +`fit` twice on the same instance. For splitters, we would just suggest the +creation of a new instance. + +However, making estimators stateless without supporting item 2) would mean +dropping support for use-case D. So item 2) should be implemented before item +1). + +Item 2) can be implemented in a backward-compatible fashion by introducing +the `use_exact_clones` parameter where relevant with a default 'warn', which +would tell users that they should be setting this parameter explictly from +now on:: + + def cross_val_score(..., use_exact_clones='warn'): + if use_exact_clones == 'warn': + warn("The use_exact_clones parameter was introduced in ..." + "you should explicitly set it to True or False...") + # proceed as before in a backward compatible way + elif use_exact_clones is True: + # build exact clones (probably by sampling a seed from the estimator's random_state) + else: + # build statistical clones (probably by creating a local RandomState instance) + +However if 1) isn't implemented yet, this is not fully satisfactory: if +estimators are still stateful, this means for example that the seed drawn in +the `elif` block would be different across calls to `cross_val_score`, since +the estimator's `random_state` would have been mutated. This is not +desirable: once 1) is implemented and the estimators become stateless, +multiple calls to `cross_val_score` would result in consistent results (as +expected). + +As a result, 1) must be implemented before 2). But since 2) must also be +implemented before 1)... we have a chicken and egg problem. + +The path with the least amount of friction would likely be to implement both +1) and 2) at the same time, and accept the fact that `cross_val_score` and +similar procedures might temporarily give different results across calls, in +cases where the estimator is still stateful (i.e. if its `random_state` is an +instance). The number of user codes where results may change once the +estimators become stateless is likely to be quite small. This may be +acceptable with a reasonable amount of outreach. + +An easier but more brutal adoption path is to just break backward +compatibility and release this in a new major version, e.g. 1.0 or 2.0. + +Alternative solutions +===================== + +Change the default of all random_state from `None` to a hardcoded int +--------------------------------------------------------------------- + +This doesn't solve much: it might limit pitfalls in users code, but does not +address the statefulness issues, nor the ambiguity of CV procedures and +meta-estimators. + +Stop supporting RandomState instances +------------------------------------- + +We would not be able to support use-cases D, F, and G, except by passing +`None`, but then it would be impossible to get reproducible results across +executions (use-case A). This doesn't address either the statefulness and +ambiguity issues previousl mentioned. + +Store a seed in fit/split instead of in init +--------------------------------------------- + +Instead of storing a seed in `__init__`, we could store it the first time +`fit()` is called. For example:: + + def fit(self): # or split() + self._random_state = check_random_state(self.random_state).randint(2*32) + rng = self._random_state + # ... + +The advantage is that we respect our convention with `__init__`. + +However, `fit` idempotency isn't respected anymore: the first call to `fit` +clearly influences all the other ones. The mental model is also not as clear +as the one of the proposed solution: users don't really know when that seed +is going to be drawn unless they know the internals of the procedures they +are using. + +This also introduces a private attribute, so we would need more intrusive +changes to `set_params`, `get_params`, and `clone`. + +Add a new parameter to estimators instead of adding a parameter to CV procedures and meta-estimators +---------------------------------------------------------------------------------------------------- + +Instead of updating each CV procedure and meta-estimator with the +`use_exact_clones` parameter, we could instead add this parameter to the +estimators that have a `random_state` attribute, and let `clone` detect what +kind of clone it needs to output depending on the estimators' corresponding +attribute. + +However, the strategy used by CV procedures and estimators would still be +somewhat implicit, and making these strategies explicit is one of the main +goal of this SLEP. It is also easier to document and to expose the different +possible strategies of a CV procedure when there is a dedicated parameter in +that procedure. + +.. _issues: + +Appendix: Issues with the legacy design +======================================= + + +Introduction: a complex mental model +------------------------------------ + +The rules that dictate how `random_state` impacts the behaviour of estimators +and CV splitters are seemingly simple: use an int or a `RandomState` instance +and get the same results across calls, use None and get different +results. Things get however much more complicated when these calls to `fit` and +`split` are embedded within other procedures like CV procedures or +meta-estimators. + +Users and core devs alike still seem to have a hard time figuring it out. +Quoting Andreas Müller from `#14042 +`_: + + The current design of random_state is often hard to understand and + confusing for users. I think we should rethink how random_state works. + +As another empirical evidence, the fact the CV splitters may yield different +splits across calls recently came as a surprise to a long-time core dev. + +At the very least, it should be noted that prior to `#9517 +`_ (merged in 2018) +and `#18363 `_ (not +yet merged as of September 2020), most subtleties related to `random_state` +were largely under-documented, and it is likely that a lot of users are +actually oblivious to most pitfalls. + +The next sections describe: + +- Issues that illustrate how `random_state` implictly affects CV procedures + and meta-estimators in insiduous ways. +- Examples of bugs that `random_state` has caused, which are another sign + that the legacy design might be too complex, even for core devs. + +Before reading these sections, please make sure to read `#18363 +`_ which provides +more context, and illustrates some use-cases that will be used here. + +.. _estimator_issues: + +Estimators: CV procedures and meta-estimators behaviour is implicitly dictated by the base estimator's `random_state` +--------------------------------------------------------------------------------------------------------------------- + +As described in more details in `#18363 +`_, the following +lines run two very different cross-validation procedures:: + + cross_val_score(RandomForestClassifier(random_state=0), X, y) + cross_val_score(RandomForestClassifier(random_state=np.RandomState(0)), X, y) + +In the first one, the estimator RNG is the same on each fold and the RF +selects the same subset of features across folds. In the second one, the +estimator RNG varies across folds and the random subset of selected features +will vary. In other words, the CV strategy that `cross_val_score` uses +implicitly depends on the type of the estimators' `random_state` parameter. +This is unexpected, since the behaviour of `cross_val_predict` should +preferably be determined by parameters passed to `cross_val_predict`. + +The same is true for any procedure that performs cross-validation (manual CV, +`GridSearchCV`, etc.). Things are particularly ambiguous in `GridSearchCV`: +when None is used, a new RNG is used on each fold, **but +also for each candidate**:: + + fold 0: use different RNGs across candidates + fold 1: use different RNGs across candidates (different RNGs from fold 0) + fold 2: use different RNGs across candidates (different RNGs from folds 0 and 1) + etc... + +Users might actually expect that the RNG would be different on each fold, but +still constant across candidates, i.e. something like:: + + fold 0: use same RNG for all candidates + fold 1: use same RNG for all candidates (different RNG from fold 0) + fold 2: use same RNG for all candidates (different RNG from folds 0 and 1) + etc... + +.. note:: + This strategy is in fact not even supported right now: neither integers, + RandomState instances or None can achieve this. + +Unfortunately, there is no way for users to figure out what strategy is used +until they look at the code. It is not just a documentation problem. The core +problem here is that **the behaviour of the CV procedure is implicitly +dictated by the estimator's** `random_state`. + +There are similar issues in meta-estimators, like `RFE` or +`SequentialFeatureSelection`: these are iterative feature selection +algorithms that will use either *exact* or *statistical* clones at each +iteration, depending on the estimator's `random_state`. Exact and statistical +clones lead to two significantly different strategies. Here again, the +behavior of these meta-estimators **is only implicitly dictated by the +estimator's** `random_state`. + +In addition, since the sub-estimator's `random_state` type dictates the +strategy, users are bound to one single strategy once the estimator has been +created: it is for example impossible for an estimator to use a different RNG +across folds if that estimator was initialized with an integer. + +It is unlikely that users have a perfect understanding of these subtleties. +For users to actually understand how `random_state` impacts the CV procedures +and meta-estimators, they actually need to know inner details of these: they +need to know where and when `fit` is called, and also when `clone` is called. + +There is a very similar problem with CV splitters as described in the next +section. + +.. _cv_splitters_issues: + +CV Splitters: users need to know inner details of CV procedures to avoid erroneous performance comparison +--------------------------------------------------------------------------------------------------------- + +CV splitters yield different splits every time `split` is called if their +`random_state` is a RandomState instance. This means that the following code +doesn't allow fold-to-fold comparison of scores:: + + rng = np.random.RandomState(0) + cv = KFold(shuffle=True, random_state=rng) + estimators = [...] # the estimators you want to compare + scores = { + est: cross_val_score(est, X, y, cv=cv) + for est in estimators + } + +Users might not realize it, but **the estimators will be evaluated on +different splits**, even though they think they've set the random state by +passing a carefuly crafted instance. This is because `cv.split` was called +multiple times, yet these calls were hidden inside of `cross_val_score`. On +top of impossible fold-to-fold comparison, comparing the average scores is +also not ideal if the number of folds or samples is small. + +As a consequence, before users can safely report score comparisons, **they +need to know how many times** `split` **is called**, which should just be an +implementation detail. While the above example is already error-prone, things +get harder in more complex tools like `GridSearchCV`: how are users supposed +to know that `split` is called only once in `GridSearchCV.fit`? + +.. note:: + This implementation detail about `GridSearchCV.fit` is in fact + documented, but only at the very end of the cross-validation `User Guide + `_. + It is not documented where it shoud be, that is, in the hyper-parameter + tuning User Guide or in the docstrings. + +.. note:: + In `#18363 `_, + we recommend users to use integers for CV splitters' `random_state`, + effectively making them stateless. + +.. note:: + Fixing how `random_state` is handled in the splitters is one of the + entries in the `Roadmap `_. + +.. _bugs: + +Bugs +---- + +Many bugs have happened over the years because of RandomState instances and +None. Quoting Andreas Müller from `#14042 +`_: + + There have been countless bugs because of this + +("*This*" = RandomState instances and the implied statefulness of the +estimators). + +Bugs caused by estimators statefulness +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +These bugs are often hard to find, and some of them are actual data leaks, +e.g. `#14034 `_. + +They arise because the estimators are stateful across calls to `fit`. Fixing +them usually involves forcing the estimator to be (at least partially) +stateless. A classical bug that happened multiple times is that the +validation set may differ across calls to `fit` in a warm-start + early +stopping context. For example, `this fix +`_ is to draw a +random seed once and to re-use that seed for data splitting when +early-stopping + warm start is used. It is *not* an obvious bug, nor an +obvious fix. + +Making estimators stateless across calls to `fit` would prevent such bugs to +happen, and would keep the code-base cleaner. + +Bugs caused by splitters statefulness +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +`#18431 `_ is a bug +introduced in `SequentialFeatureSelection` that perfectly illustrates the +previous section :ref:`cv_splitters_issues`. The bug was that splitter +statefulness would lead to comparing average scores of candidates that have +been evaluated on different splits. Here again, the fix is to enforce +statelessness of the splitter, e.g. +`KFolds(5, shuffle=True, random_state=None)` is forbidden. + +.. note:: + This bug was introduced by Nicolas Hug, who is this SLEP's author: it's + very easy to let these bugs sneak in, even when you're trying hard not + to. + +Other potential bugs can happen in the parameter search estimators. When a +third-party library wants to implement its own parameter search strategy, it +needs to subclass `BaseSearchCV` and call a built-in function +`evaluate_candidates(candidates)` once, or multiple times. +`evaluate_candidates` internally calls `split` once. If +`evaluate_candidates` is called more than once, this means that **the +candidate parameters are evaluated on different splits each time**. + +This is a quite subtle issue that third-party developers might easily +overlook. Some core devs (Joel Nothman and Nicolas Hug) kept forgetting and +re-discovering this issue over and over in the `Successive Halving PR +`_. + +At the very least, this makes fold-to-fold comparison between candidates +impossible whenever the search strategy calls `evaluate_candidates` more +than once. This can however cause bigger bugs in other scenarios, e.g. if we +implement successive halving + warm start (details ommitted here, you may +refer to `this issue +`_). + +.. note:: + In order to prevent any potential future bug and to prevent users from + making erroneous comparisons between the candidates scores, the + `Successive Halving implementation + `_ + forbids users from using stateful splitters, just like + `SequentialFeatureSelection` (see the note in the docstring for the `cv` + parameter). + +Other issues +------------ + +Fit idempotency isn't respected +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Quoting our `Developer Guidelines +`_: + + When fit is called, any previous call to fit should be ignored. + +This means that ideally, calling `est.fit(X, y)` should yield the same model +twice. We have a check for that in the `check_estimator()` suite: +`check_fit_idempotent()`. Clearly, this fit-idempotency property is violated +when None is used. + +`clone` 's behaviour is implicit +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Much like CV procedures and meta-estimators, what `clone` returns implicitly +depends on the estimators' `random_state`: it may return an exact clone or a +statistical clone. + +.. note:: + While less user-facing, this issue is actually part of the root cause for + the aforementioned issues related to estimators in CV procedures and in + meta-estimators. + +.. References and Footnotes +.. ------------------------ + +.. .. [1] Each SLEP must either be explicitly labeled as placed in the public +.. domain (see this SLEP as an example) or licensed under the `Open +.. Publication License`_. + +.. .. _Open Publication License: https://www.opencontent.org/openpub/ + + +.. Copyright +.. --------- + +.. This document has been placed in the public domain. [1]_