|
1 | 1 | """
|
2 |
| -Feature Generation and Parameter Selection for Linear Methods |
| 2 | +Hyperparameter Search for Linear Methods |
3 | 3 | =============================================================
|
| 4 | +This guide helps users to tune the hyperparameters of the feature generation step and the linear model. |
4 | 5 |
|
5 |
| -This tutorial demonstrates feature generation and parameter selection for linear methods. |
6 |
| -
|
7 |
| -Here we show an example of training a linear text classifier with the rcv1 dataset. |
8 |
| -If you haven't downloaded it yet, see `Data Preparation <../cli/linear.html#step-1-data-preparation>`_. |
9 |
| -Then you can read and preprocess the data as follows |
| 6 | +Here we show an example of tuning a linear text classifier with the `rcv1 dataset <https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html#rcv1v2%20(topics;%20full%20sets)>`_. |
| 7 | +Starting with loading and preprocessing of the data without using ``Preprocessor``: |
10 | 8 | """
|
11 | 9 |
|
12 | 10 | from sklearn.preprocessing import MultiLabelBinarizer
|
|
17 | 15 | y = binarizer.fit_transform(datasets["train"]["y"]).astype("d")
|
18 | 16 |
|
19 | 17 | ######################################################################
|
20 |
| -# We format labels into a 0/1 sparse matrix with ``MultiLabelBinarizer``. |
21 |
| -# |
22 |
| -# Feature Generation |
23 |
| -# ------------------ |
24 |
| -# Before training a linear classifier, we must convert each text to a vector of numerical features. |
25 |
| -# To use the default setting (TF-IDF features), check |
26 |
| -# `Linear Model for MultiLabel Classification <../auto_examples/plot_linear_quickstart.html#linear-model-for-multi-label-classification>`_ |
27 |
| -# for easily conducting training and testing. |
28 |
| -# |
29 |
| -# If you want to tweak the generation of TF-IDF features, consider |
30 |
| - |
31 |
| -from sklearn.feature_extraction.text import TfidfVectorizer |
32 |
| - |
33 |
| -vectorizer = TfidfVectorizer(max_features=20000, min_df=3) |
34 |
| -x = vectorizer.fit_transform(datasets["train"]["x"]) |
35 |
| -model = linear.train_1vsrest(y, x, "-s 2 -m 4") |
36 |
| - |
37 |
| -####################################################################### |
38 |
| -# We use the generated numerical features ``x`` as the input of |
39 |
| -# the linear method ``linear.train_1vsrest``. |
| 18 | +# we format labels into a 0/1 sparse matrix with ``MultiLabelBinarizer``. |
40 | 19 | #
|
41 |
| -# An Alternative Way for Using a Linear Method |
42 |
| -# -------------------------------------------- |
43 |
| -# Besides the default way shown in `Feature Generation <#feature-generation>`_, |
44 |
| -# we can construct a sklearn estimator for training and prediction. |
45 |
| -# This way is used namely for parameter selection described later, |
46 |
| -# as the estimator makes LibMultiLabel methods in a sklearn Pipeline for a grid search. |
| 20 | +# Next, we construct a ``Pipeline`` object that will be used for hyperparameter search later. |
47 | 21 |
|
48 | 22 | from sklearn.feature_extraction.text import TfidfVectorizer
|
49 | 23 | from sklearn.pipeline import Pipeline
|
|
56 | 30 | )
|
57 | 31 |
|
58 | 32 | ######################################################################
|
59 |
| -# For the estimator ``MultiLabelEstimator``, arguments ``options`` is a LIBLINEAR option |
| 33 | +# The vectorizor ``TfidfVectorizer`` is used in ``Pipeline`` to generate TF-IDF features from raw texts. |
| 34 | +# As for the estimator ``MultiLabelEstimator``, argument ``options`` is a LIBLINEAR option |
60 | 35 | # (see *train Usage* in `liblinear <https://github.com/cjlin1/liblinear>`__ README), and
|
61 |
| -# ``linear_technique`` is one of linear techniques: ``1vsrest``, ``thresholding``, ``cost_sensitive``, |
62 |
| -# ``cost_sensitive_micro``, and ``binary_and_mulitclass``. |
63 |
| -# In ``pipeline``, we specify settings used by the estimator. |
| 36 | +# ``linear_technique`` is one of the linear techniques, including ``1vsrest``, ``thresholding``, ``cost_sensitive``, |
| 37 | +# ``cost_sensitive_micro``, and ``binary_and_multiclass``. |
| 38 | +# |
| 39 | +# We can specify the aliases of the components used by the pipeline. |
64 | 40 | # For example, ``tfidf`` is the alias of ``TfidfVectorizer`` and ``clf`` is the alias of the estimator.
|
65 | 41 | #
|
66 |
| -# We can then use the following code for training. |
67 |
| -pipeline.fit(datasets["train"]["x"], y) |
68 |
| - |
69 |
| -###################################################################### |
70 |
| -# Grid Search over Feature Generations and LIBLINEAR Options |
71 |
| -# ----------------------------------------------------------- |
72 |
| -# To search for the best setting, we can employ ``GridSearchCV``. |
| 42 | +# To search for the best setting, we employ ``GridSearchCV``. |
73 | 43 | # The usage is similar to sklearn's except that the parameter ``scoring`` is not available. Please specify
|
74 | 44 | # ``scoring_metric`` in ``linear.MultiLabelEstimator`` instead.
|
75 |
| -liblinear_options = ["-s 2 -c 0.5", "-s 2 -c 1", "-s 2 -c 2"] |
| 45 | + |
| 46 | +liblinear_options = ["-s 2 -c 0.5", "-s 2 -c 1", "-s 2 -c 2", "-s 1 -c 0.5", "-s 1 -c 1", "-s 1 -c 2"] |
76 | 47 | parameters = {"clf__options": liblinear_options, "tfidf__max_features": [10000, 20000, 40000], "tfidf__min_df": [3, 5]}
|
77 | 48 | clf = linear.GridSearchCV(pipeline, parameters, cv=5, n_jobs=4, verbose=1)
|
78 | 49 | clf = clf.fit(datasets["train"]["x"], y)
|
79 | 50 |
|
80 | 51 | ######################################################################
|
81 |
| -# Here we check the combinations of six feature generations and three regularization parameters |
| 52 | +# Here we check the combinations of six feature generation options and six liblinear options |
82 | 53 | # in the linear classifier. The key in ``parameters`` should follow the sklearn's coding rule
|
83 | 54 | # starting with the estimator's alias and two underscores (i.e., ``clf__``).
|
84 | 55 | # We specify ``n_jobs=4`` to run four tasks in parallel.
|
85 |
| -# After finishing gridsearch, we can get the best parameters by the following code: |
| 56 | +# After finishing the grid search, we can get the best parameters by the following code: |
86 | 57 |
|
87 | 58 | for param_name in sorted(parameters.keys()):
|
88 | 59 | print(f"{param_name}: {clf.best_params_[param_name]}")
|
89 | 60 |
|
90 | 61 | ######################################################################
|
91 | 62 | # The best parameters are::
|
92 | 63 | #
|
93 |
| -# clf__options: '-s 2 -c 0.5 -m 1' |
94 |
| -# tfidf__max_features: 20000 |
95 |
| -# tfidf__min_df: 3 |
| 64 | +# clf__options: -s 2 -c 0.5 -m 1 |
| 65 | +# tfidf__max_features: 10000 |
| 66 | +# tfidf__min_df: 5 |
96 | 67 | #
|
97 |
| -# For testing, we also need to read in data first and format test labels into a 0/1 sparse matrix. |
98 |
| - |
99 |
| -y = binarizer.transform(datasets["test"]["y"]).astype("d").toarray() |
100 |
| - |
101 |
| -###################################################################### |
102 |
| -# Applying the ``predict`` function of ``GridSearchCV`` object to use the |
103 |
| -# estimator trained under the best hyper-parameters for prediction. |
| 68 | +# Note that in the above code, the ``refit`` argument of ``GridSearchCV`` is enabled by default, meaning that the best configuration will be trained on the whole dataset after hyperparameter search. |
| 69 | +# We refer to this as the retrain strategy. |
| 70 | +# After fitting ``GridSearchCV``, the retrained model is stored in ``clf``. |
| 71 | +# |
| 72 | +# We can apply the ``predict`` function of ``GridSearchCV`` object to use the estimator trained under the best hyperparameters for prediction. |
104 | 73 | # Then use ``linear.compute_metrics`` to calculate the test performance.
|
105 | 74 |
|
| 75 | +# For testing, we also need to read in data first and format test labels into a 0/1 sparse matrix. |
| 76 | +y = binarizer.transform(datasets["test"]["y"]).astype("d").toarray() |
106 | 77 | preds = clf.predict(datasets["test"]["x"])
|
107 | 78 | metrics = linear.compute_metrics(
|
108 | 79 | preds,
|
|
114 | 85 | ######################################################################
|
115 | 86 | # The result of the best parameters will look similar to::
|
116 | 87 | #
|
117 |
| -# {'Macro-F1': 0.4965720851051106, 'Micro-F1': 0.8004678830627301, 'P@1': 0.9587412721675744, 'P@3': 0.8021469454453142, 'P@5': 0.5605401496291271} |
| 88 | +# {'Macro-F1': 0.5296621774388927, 'Micro-F1': 0.8021279986938116, 'P@1': 0.9561621216872636, 'P@3': 0.7983185389507189, 'P@5': 0.5570921518306848} |
0 commit comments