Skip to content

Commit eb711ba

Browse files
authored
Merge pull request #381 from Gordon119/pipeline_tutorial
Multi-Label Tutorial
2 parents 2982ed3 + 6b84c38 commit eb711ba

15 files changed

+364
-101
lines changed

docs/data_preparation.rst

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
Data Preparation
2+
================
3+
4+
.. toctree::
5+
:maxdepth: 1
6+
:titlesonly:
7+
8+
9+
../auto_examples/plot_dataset_tutorial
10+
../auto_examples/plot_linear_feature_gen

docs/examples/plot_dataset_tutorial.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
"""
2-
An Example of Using Data Stored in Different Forms
2+
Using Data not in Default Forms
33
===================================================
44
5-
Different data sets are stored in various structures and formats.
5+
Different datasets are stored in various structures and formats.
66
To apply LibMultiLabel with any of them, one must convert the data to a form accepted by the library first.
77
In this tutorial, we demonstrate an example of converting a hugging face data set.
88
Before we start, note that LibMultiLabel format consists of IDs (optional), labels, and raw texts.
@@ -21,8 +21,8 @@
2121
from datasets import load_dataset
2222

2323
######################################################################
24-
# We choose a multi-label set ``emoji`` from ``tweet_eval`` in this example.
25-
# The data set can be loaded by the following code.
24+
# We choose a multi-label dataset ``emoji`` from ``tweet_eval`` in this example.
25+
# The dataset can be loaded by the following code.
2626

2727
hf_datasets = dict()
2828
hf_datasets["train"] = load_dataset("tweet_eval", "emoji", split="train")
@@ -60,9 +60,9 @@
6060
datasets = preprocessor.fit_transform(datasets)
6161

6262
###############################################################################
63-
# Also, if you want to use a NN model,
63+
# In this case, if you want to use a deep learning model,
6464
# use ``load_datasets`` from ``libmultilabel.nn.data_utils`` and change the data to the dataframes we created.
65-
# Here is the modification of our `Bert model quickstart <https://www.csie.ntu.edu.tw/~cjlin/libmultilabel/auto_examples/plot_BERT_quickstart.html>`_.
65+
# Here is the modification of our `Bert model quickstart <../auto_examples/plot_bert_quickstart.html>`_.
6666

6767
from libmultilabel.nn.data_utils import load_datasets
6868

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
"""
2+
Tweaking Feature Generation for Linear Methods
3+
=============================================================
4+
5+
In both `API <../auto_examples/plot_linear_quickstart.html>`_ and `CLI <../cli/linear.html>`_ usage of linear methods, LibMultiLabel handles the feature generation step by default.
6+
Unless necessary, you do not need to generate features in different ways as described in this tutorial.
7+
8+
This tutorial demonstrates how to customize the way to generate features for linear methods through an API example.
9+
Here we use the `rcv1 <https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html#rcv1v2%20(topics;%20full%20sets)>`_ dataset as an example.
10+
"""
11+
12+
from sklearn.preprocessing import MultiLabelBinarizer
13+
from libmultilabel import linear
14+
15+
datasets = linear.load_dataset("txt", "data/rcv1/train.txt", "data/rcv1/test.txt")
16+
tfidf_params = {
17+
"max_features": 20000,
18+
"min_df": 3,
19+
"ngram_range": (1, 3)
20+
}
21+
preprocessor = linear.Preprocessor(tfidf_params=tfidf_params)
22+
preprocessor.fit(datasets)
23+
datasets = preprocessor.transform(datasets)
24+
25+
############################################
26+
# The argument ``tfidf_params`` of the ``Preprocessor`` can specify how to generate the TF-IDF features.
27+
# In this example, we adjust the ``max_features``, ``min_df``, and ``ngram_range`` of the preprocessor.
28+
# For explanation of these three and other options, refer to the `sklearn page <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html>`_.
29+
# Users can also try other methods to generalize features, like word embedding.
30+
#
31+
# Finally, we use the generated numerical features to train and evaluate the model.
32+
# The rest of the steps is the same in the quickstarts.
33+
# Please refer to them for details.
Lines changed: 28 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,10 @@
11
"""
2-
Feature Generation and Parameter Selection for Linear Methods
2+
Hyperparameter Search for Linear Methods
33
=============================================================
4+
This guide helps users to tune the hyperparameters of the feature generation step and the linear model.
45
5-
This tutorial demonstrates feature generation and parameter selection for linear methods.
6-
7-
Here we show an example of training a linear text classifier with the rcv1 dataset.
8-
If you haven't downloaded it yet, see `Data Preparation <../cli/linear.html#step-1-data-preparation>`_.
9-
Then you can read and preprocess the data as follows
6+
Here we show an example of tuning a linear text classifier with the `rcv1 dataset <https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html#rcv1v2%20(topics;%20full%20sets)>`_.
7+
Starting with loading and preprocessing of the data without using ``Preprocessor``:
108
"""
119

1210
from sklearn.preprocessing import MultiLabelBinarizer
@@ -17,33 +15,9 @@
1715
y = binarizer.fit_transform(datasets["train"]["y"]).astype("d")
1816

1917
######################################################################
20-
# We format labels into a 0/1 sparse matrix with ``MultiLabelBinarizer``.
21-
#
22-
# Feature Generation
23-
# ------------------
24-
# Before training a linear classifier, we must convert each text to a vector of numerical features.
25-
# To use the default setting (TF-IDF features), check
26-
# `Linear Model for MultiLabel Classification <../auto_examples/plot_linear_quickstart.html#linear-model-for-multi-label-classification>`_
27-
# for easily conducting training and testing.
28-
#
29-
# If you want to tweak the generation of TF-IDF features, consider
30-
31-
from sklearn.feature_extraction.text import TfidfVectorizer
32-
33-
vectorizer = TfidfVectorizer(max_features=20000, min_df=3)
34-
x = vectorizer.fit_transform(datasets["train"]["x"])
35-
model = linear.train_1vsrest(y, x, "-s 2 -m 4")
36-
37-
#######################################################################
38-
# We use the generated numerical features ``x`` as the input of
39-
# the linear method ``linear.train_1vsrest``.
18+
# we format labels into a 0/1 sparse matrix with ``MultiLabelBinarizer``.
4019
#
41-
# An Alternative Way for Using a Linear Method
42-
# --------------------------------------------
43-
# Besides the default way shown in `Feature Generation <#feature-generation>`_,
44-
# we can construct a sklearn estimator for training and prediction.
45-
# This way is used namely for parameter selection described later,
46-
# as the estimator makes LibMultiLabel methods in a sklearn Pipeline for a grid search.
20+
# Next, we construct a ``Pipeline`` object that will be used for hyperparameter search later.
4721

4822
from sklearn.feature_extraction.text import TfidfVectorizer
4923
from sklearn.pipeline import Pipeline
@@ -56,53 +30,50 @@
5630
)
5731

5832
######################################################################
59-
# For the estimator ``MultiLabelEstimator``, arguments ``options`` is a LIBLINEAR option
33+
# The vectorizor ``TfidfVectorizer`` is used in ``Pipeline`` to generate TF-IDF features from raw texts.
34+
# As for the estimator ``MultiLabelEstimator``, argument ``options`` is a LIBLINEAR option
6035
# (see *train Usage* in `liblinear <https://github.com/cjlin1/liblinear>`__ README), and
61-
# ``linear_technique`` is one of linear techniques: ``1vsrest``, ``thresholding``, ``cost_sensitive``,
62-
# ``cost_sensitive_micro``, and ``binary_and_mulitclass``.
63-
# In ``pipeline``, we specify settings used by the estimator.
36+
# ``linear_technique`` is one of the linear techniques, including ``1vsrest``, ``thresholding``, ``cost_sensitive``,
37+
# ``cost_sensitive_micro``, and ``binary_and_multiclass``.
38+
#
39+
# We can specify the aliases of the components used by the pipeline.
6440
# For example, ``tfidf`` is the alias of ``TfidfVectorizer`` and ``clf`` is the alias of the estimator.
6541
#
66-
# We can then use the following code for training.
67-
pipeline.fit(datasets["train"]["x"], y)
68-
69-
######################################################################
70-
# Grid Search over Feature Generations and LIBLINEAR Options
71-
# -----------------------------------------------------------
72-
# To search for the best setting, we can employ ``GridSearchCV``.
42+
# To search for the best setting, we employ ``GridSearchCV``.
7343
# The usage is similar to sklearn's except that the parameter ``scoring`` is not available. Please specify
7444
# ``scoring_metric`` in ``linear.MultiLabelEstimator`` instead.
75-
liblinear_options = ["-s 2 -c 0.5", "-s 2 -c 1", "-s 2 -c 2"]
45+
46+
liblinear_options = ["-s 2 -c 0.5", "-s 2 -c 1", "-s 2 -c 2", "-s 1 -c 0.5", "-s 1 -c 1", "-s 1 -c 2"]
7647
parameters = {"clf__options": liblinear_options, "tfidf__max_features": [10000, 20000, 40000], "tfidf__min_df": [3, 5]}
7748
clf = linear.GridSearchCV(pipeline, parameters, cv=5, n_jobs=4, verbose=1)
7849
clf = clf.fit(datasets["train"]["x"], y)
7950

8051
######################################################################
81-
# Here we check the combinations of six feature generations and three regularization parameters
52+
# Here we check the combinations of six feature generation options and six liblinear options
8253
# in the linear classifier. The key in ``parameters`` should follow the sklearn's coding rule
8354
# starting with the estimator's alias and two underscores (i.e., ``clf__``).
8455
# We specify ``n_jobs=4`` to run four tasks in parallel.
85-
# After finishing gridsearch, we can get the best parameters by the following code:
56+
# After finishing the grid search, we can get the best parameters by the following code:
8657

8758
for param_name in sorted(parameters.keys()):
8859
print(f"{param_name}: {clf.best_params_[param_name]}")
8960

9061
######################################################################
9162
# The best parameters are::
9263
#
93-
# clf__options: '-s 2 -c 0.5 -m 1'
94-
# tfidf__max_features: 20000
95-
# tfidf__min_df: 3
64+
# clf__options: -s 2 -c 0.5 -m 1
65+
# tfidf__max_features: 10000
66+
# tfidf__min_df: 5
9667
#
97-
# For testing, we also need to read in data first and format test labels into a 0/1 sparse matrix.
98-
99-
y = binarizer.transform(datasets["test"]["y"]).astype("d").toarray()
100-
101-
######################################################################
102-
# Applying the ``predict`` function of ``GridSearchCV`` object to use the
103-
# estimator trained under the best hyper-parameters for prediction.
68+
# Note that in the above code, the ``refit`` argument of ``GridSearchCV`` is enabled by default, meaning that the best configuration will be trained on the whole dataset after hyperparameter search.
69+
# We refer to this as the retrain strategy.
70+
# After fitting ``GridSearchCV``, the retrained model is stored in ``clf``.
71+
#
72+
# We can apply the ``predict`` function of ``GridSearchCV`` object to use the estimator trained under the best hyperparameters for prediction.
10473
# Then use ``linear.compute_metrics`` to calculate the test performance.
10574

75+
# For testing, we also need to read in data first and format test labels into a 0/1 sparse matrix.
76+
y = binarizer.transform(datasets["test"]["y"]).astype("d").toarray()
10677
preds = clf.predict(datasets["test"]["x"])
10778
metrics = linear.compute_metrics(
10879
preds,
@@ -114,4 +85,4 @@
11485
######################################################################
11586
# The result of the best parameters will look similar to::
11687
#
117-
# {'Macro-F1': 0.4965720851051106, 'Micro-F1': 0.8004678830627301, 'P@1': 0.9587412721675744, 'P@3': 0.8021469454453142, 'P@5': 0.5605401496291271}
88+
# {'Macro-F1': 0.5296621774388927, 'Micro-F1': 0.8021279986938116, 'P@1': 0.9561621216872636, 'P@3': 0.7983185389507189, 'P@5': 0.5570921518306848}

docs/examples/plot_linear_tree_tutorial.py

Lines changed: 16 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,25 @@
11
"""
2-
Handling Data with Many Labels
3-
==============================
2+
Handling Data with Many Labels Using Linear Methods
3+
====================================================
44
55
For the case that the amount of labels is very large,
66
the training time of the standard ``train_1vsrest`` method may be unpleasantly long.
77
The ``train_tree`` method in LibMultiLabel can vastly improve the training time on such data sets.
88
9-
To illustrate this speedup, we will use the `EUR-Lex dataset <https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html#EUR-Lex>`_,
10-
which contains 3,956 labels.
11-
In this example, the data is downloaded under the directory ``data/eur-lex``.
9+
To illustrate this speedup, we will use the `EUR-Lex dataset <https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html#EUR-Lex>`_, which contains 3,956 labels.
10+
The data in the following example is downloaded under the directory ``data/eur-lex``
11+
12+
Users can use the following command to easily apply the ``train_tree`` method.
13+
14+
.. code-block:: bash
15+
16+
$ python3 main.py --training_file data/eur-lex/train.txt
17+
--test_file data/eur-lex/test.txt
18+
--linear
19+
--linear_technique tree
20+
21+
Besides CLI usage, users can also use API to apply ``train_tree`` method.
22+
Below is an example.
1223
"""
1324

1425
import math
@@ -88,6 +99,3 @@ def metrics_in_batches(model):
8899
print("Score of 1vsrest:", metrics_in_batches(ovr_model))
89100
print("Score of tree:", metrics_in_batches(tree_model))
90101

91-
######################################################################
92-
#
93-
# .. bibliography::

0 commit comments

Comments
 (0)