Skip to content

Commit aac43e7

Browse files
author
Thomas Bury
committed
style: 💄 apply black
1 parent 416d639 commit aac43e7

File tree

5 files changed

+59
-53
lines changed

5 files changed

+59
-53
lines changed

docs/Methods overview.rst

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ BorutaPy vs. Boruta R:
6262
* Using either the native variable importance, scikit permutation importance, SHAP importance.
6363

6464
We highly recommend using pruned trees with a depth between 3-7. For more, see the docs of these functions, and the examples below. Original code and method by: Miron B Kursa, https://m2.icm.edu.pl/boruta/
65-
65+
6666
GrootCV, a new method
6767
---------------------
6868

@@ -84,9 +84,7 @@ Re-implementing the Uber MRmr scheme using associations for handling continuous
8484
Lasso
8585
-----
8686

87-
Performing a simple grid search
88-
89-
with enforced lasso regularization.
87+
Performing a simple grid search with enforced lasso regularization.
9088
The best model is chosen based on the minimum BIC or deviance score, and all non-zero coefficients are selected.
9189
The loss function can belong to the exponential family, as seen in the statsmodels GLM documentation.
9290
Using the bic metric is faster since it is evaluated on the training data, making it unsuitable for the test data, whereas the deviance is cross-validated.

src/arfs/association.py

Lines changed: 29 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1436,7 +1436,8 @@ def _callable_association_matrix_fn(
14361436

14371437

14381438
def f_oneway_weighted(*args):
1439-
"""Calculate the weighted F-statistic for one-way ANOVA (continuous target, categorical predictor).
1439+
"""
1440+
Calculate the weighted F-statistic for one-way ANOVA (continuous target, categorical predictor).
14401441
14411442
Parameters
14421443
----------
@@ -1455,6 +1456,7 @@ def f_oneway_weighted(*args):
14551456
Notes
14561457
-----
14571458
The F-statistic is calculated as:
1459+
14581460
.. math::
14591461
F(rf) = \\frac{\\sum_i (\\bar{Y}_{i \\bullet} - \\bar{Y})^2 / (K-1)}{\\sum_i \\sum_k (\\bar{Y}_{ij} - \\bar{Y}_{i\\bullet})^2 / (N - K)}
14601462
"""
@@ -1667,13 +1669,13 @@ def f_cont_regression_parallel(
16671669
def f_stat_regression_parallel(
16681670
X, y, sample_weight=None, n_jobs=-1, force_finite=True, handle_na="drop"
16691671
):
1670-
"""f_stat_regression_parallel computes the weighted explained variance for the provided categorical
1671-
and numerical predictors using parallelization of the code.
1672+
"""
1673+
Compute the weighted explained variance for the provided categorical and numerical predictors using parallelization.
16721674
16731675
Parameters
16741676
----------
16751677
X : array-like of shape (n_samples, n_features)
1676-
Predictor dataframe.
1678+
The predictor dataframe.
16771679
y : array-like of shape (n_samples,)
16781680
The target vector.
16791681
sample_weight : array-like of shape (n_samples,), optional
@@ -1835,7 +1837,8 @@ def f_cat_classification_parallel(
18351837
force_finite=True,
18361838
handle_na="drop",
18371839
):
1838-
"""Univariate information dependence.
1840+
"""
1841+
Univariate information dependence.
18391842
18401843
It ranks features in the same order if all the features are positively correlated with the target.
18411844
Note that it is therefore recommended as a feature selection criterion to identify
@@ -1858,15 +1861,15 @@ def f_cat_classification_parallel(
18581861
Whether or not to force the F-statistics and associated p-values to
18591862
be finite. There are two cases where the F-statistic is expected to not
18601863
be finite:
1861-
- when the target `y` or some features in `X` are constant. In this
1862-
case, the Pearson's R correlation is not defined leading to obtain
1863-
`np.nan` values in the F-statistic and p-value. When
1864-
`force_finite=True`, the F-statistic is set to `0.0` and the
1865-
associated p-value is set to `1.0`.
1866-
- when a feature in `X` is perfectly correlated (or
1867-
anti-correlated) with the target `y`. In this case, the F-statistic
1868-
is expected to be `np.inf`. When `force_finite=True`, the F-statistic
1869-
is set to `np.finfo(dtype).max`.
1864+
- when the target `y` or some features in `X` are constant. In this
1865+
case, the Pearson's R correlation is not defined leading to obtain
1866+
`np.nan` values in the F-statistic and p-value. When
1867+
`force_finite=True`, the F-statistic is set to `0.0` and the
1868+
associated p-value is set to `1.0`.
1869+
- when a feature in `X` is perfectly correlated (or
1870+
anti-correlated) with the target `y`. In this case, the F-statistic
1871+
is expected to be `np.inf`. When `force_finite=True`, the F-statistic
1872+
is set to `np.finfo(dtype).max`.
18701873
18711874
Returns
18721875
-------
@@ -1908,13 +1911,13 @@ def f_cat_classification_parallel(
19081911
def f_stat_classification_parallel(
19091912
X, y, sample_weight=None, n_jobs=-1, force_finite=True, handle_na="drop"
19101913
):
1911-
"""f_stat_classification_parallel computes the weighted ANOVA F-value for the provided categorical
1912-
and numerical predictors using parallelization of the code.
1914+
"""
1915+
Compute the weighted ANOVA F-value for the provided categorical and numerical predictors using parallelization.
19131916
19141917
Parameters
19151918
----------
19161919
X : array-like of shape (n_samples, n_features)
1917-
Predictor dataframe.
1920+
The predictor dataframe.
19181921
y : array-like of shape (n_samples,)
19191922
The target vector.
19201923
sample_weight : array-like of shape (n_samples,), optional
@@ -2110,26 +2113,27 @@ def xy_to_matrix(xy):
21102113

21112114

21122115
def cluster_sq_matrix(sq_matrix, method="ward"):
2113-
"""cluster_sq_matrix applies agglomerative clustering in order to sort
2114-
a correlation matrix.
2116+
"""
2117+
Apply agglomerative clustering to sort a square correlation matrix.
21152118
21162119
Parameters
21172120
----------
21182121
sq_matrix : pd.DataFrame
2119-
a square correlation matrix
2122+
A square correlation matrix.
21202123
method : str, optional
2121-
linkage method, by default "ward"
2124+
The linkage method, by default "ward".
21222125
21232126
Returns
21242127
-------
21252128
pd.DataFrame
2126-
a sorted square matrix
2129+
A sorted square matrix.
2130+
2131+
Example
2132+
-------
2133+
>>> from some_module import association_matrix, cluster_sq_matrix
21272134
2128-
Example:
2129-
--------
21302135
>>> assoc = association_matrix(iris_df, plot=False)
21312136
>>> assoc_clustered = cluster_sq_matrix(assoc, method="complete")
2132-
21332137
"""
21342138
d = sch.distance.pdist(sq_matrix.values)
21352139
L = sch.linkage(d, method=method)

src/arfs/feature_selection/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,5 +23,5 @@
2323
"BoostAGroota",
2424
"GrootCV",
2525
"MinRedundancyMaxRelevance",
26-
"LassoFeatureSelection"
26+
"LassoFeatureSelection",
2727
]

src/arfs/feature_selection/lasso.py

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -89,17 +89,17 @@ def __init__(
8989
link:
9090
the GLM link function
9191
alpha :
92-
The penalty weight. If a scalar, the same penalty weight applies to all variables in the model.
92+
The penalty weight. If a scalar, the same penalty weight applies to all variables in the model.
9393
If a vector, it must have the same length as params, and contains a penalty weight for each coefficient.
9494
L1_wt :
95-
The `L1_wt` parameter represents the weight of the L1 penalty term in the model and
96-
should be within the range 0 to 1. A value of 0 corresponds to ridge regression,
97-
while a value of 1 corresponds to lasso regression. However, for obtaining statistics,
98-
`L1_wt` should be set to a value greater than 0. If it is set to 0.0, statsmodels returns
99-
a ridge regularized wrapper without refitting the model, making the statistics unavailable
100-
and breaking the class. Nevertheless, you can set `L1_wt` to a very small value, such as 1e-9,
95+
The `L1_wt` parameter represents the weight of the L1 penalty term in the model and
96+
should be within the range 0 to 1. A value of 0 corresponds to ridge regression,
97+
while a value of 1 corresponds to lasso regression. However, for obtaining statistics,
98+
`L1_wt` should be set to a value greater than 0. If it is set to 0.0, statsmodels returns
99+
a ridge regularized wrapper without refitting the model, making the statistics unavailable
100+
and breaking the class. Nevertheless, you can set `L1_wt` to a very small value, such as 1e-9,
101101
to obtain close-to-ridge behavior while still obtaining the necessary statistics.
102-
102+
103103
fit_intercept :
104104
Whether to fit an intercept term in the model.
105105
"""
@@ -157,13 +157,13 @@ def fit(
157157
self : object
158158
Returns self.
159159
"""
160-
160+
161161
# see the if kwargs.get("L1_wt", 1) == 0 condition in
162162
# https://www.statsmodels.org/dev/_modules/statsmodels/genmod/generalized_linear_model.html#GLM.fit_regularized
163163
# workaround to get the statistics
164164
if self.alpha == 0.0:
165165
self.alpha = 1e-9
166-
166+
167167
if not isinstance(X, pd.DataFrame):
168168
X = pd.DataFrame(X)
169169
X.columns = [f"pred_{i}" for i in range(X.shape[1])]

src/arfs/preprocessing.py

Lines changed: 17 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@
4242
# fix random seed for reproducibility
4343
np.random.seed(7)
4444

45+
4546
class OrdinalEncoderPandas(OrdinalEncoder):
4647
# class OrdinalEncoderPandas(BaseEstimator, TransformerMixin):
4748
"""Encode categorical features as an integer array and returns a pandas DF.
@@ -391,10 +392,11 @@ def cat_var(data, col_excl=None, return_cat=True):
391392

392393

393394
class TreeDiscretizer(BaseEstimator, TransformerMixin):
394-
"""The purpose of the function is to discretize continuous and/or categorical data, returning a pandas DataFrame.
395-
It is designed to support regression and binary classification tasks. Discretization, also known as quantization or binning,
396-
allows for the partitioning of continuous features into discrete values. In certain datasets with continuous attributes,
397-
discretization can be beneficial as it transforms the dataset into one with only nominal attributes.
395+
"""
396+
Discretize continuous and/or categorical data using univariate regularized trees, returning a pandas DataFrame.
397+
The TreeDiscretizer is designed to support regression and binary classification tasks.
398+
Discretization, also known as quantization or binning, allows for the partitioning of continuous features into discrete values.
399+
In certain datasets with continuous attributes, discretization can be beneficial as it transforms the dataset into one with only nominal attributes.
398400
Additionally, for categorical predictors, grouping levels can help reduce overfitting and create meaningful clusters.
399401
400402
By encoding discretized features, a model can become more expressive while maintaining interpretability.
@@ -502,22 +504,22 @@ def __init__(
502504
self.cat_features = None
503505

504506
def fit(self, X, y, sample_weight=None):
505-
"""Fit the discretizer on `X`.
507+
"""
508+
Fit the TreeDiscretizer on the input data.
506509
507510
Parameters
508511
----------
509512
X : array-like of shape (n_samples, n_features)
510-
Input data with shape (n_samples, n_features), where `n_samples` is the number of samples and
511-
`n_features` is the number of features.
513+
The predictor dataframe.
512514
y : array-like of shape (n_samples,)
513-
Target for internally fitting the tree(s).
515+
The target vector.
514516
sample_weight : array-like of shape (n_samples,), optional
515-
Sample weight (e.g., exposure) if any.
517+
The weight vector, by default None.
516518
517519
Returns
518520
-------
519-
X : pd.DataFrame
520-
DataFrame with the binned and grouped columns.
521+
self : object
522+
Returns self.
521523
"""
522524
X = X.copy()
523525

@@ -640,7 +642,8 @@ def fit(self, X, y, sample_weight=None):
640642
return self
641643

642644
def transform(self, X):
643-
"""Apply the discretizer on `X`. Only the columns with more than n_bins_max unique values will be transformed.
645+
"""
646+
Apply the discretizer on `X`. Only the columns with more than n_bins_max unique values will be transformed.
644647
645648
Parameters
646649
----------
@@ -690,7 +693,8 @@ def transform(self, X):
690693

691694

692695
def highlight_discarded(s):
693-
"""highlight X in red and V in green.
696+
"""
697+
highlight X in red and V in green.
694698
695699
Parameters
696700
----------

0 commit comments

Comments
 (0)