You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/contributing.rst
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -66,7 +66,7 @@ The RSMTool codebase enforces a certain code style via pre-commit checks and thi
66
66
67
67
Rather than doing this grouping and sorting manually, we use the `isort <https://pycqa.github.io/isort/>`_ pre-commit hook to achieve this.
68
68
69
-
#. All classes, functions, and methods in the main code files have `numpy-formatted docstrings <https://numpydoc.readthedocs.io/en/latest/format.html>`_ that comply with `PEP 257 <https://www.python.org/dev/peps/pep-0257/>`_. This is enforced via the `pydocstyle <http://www.pydocstyle.org/en/stable/>`_ pre-commit check. Additionally, when writing docstrings, make sure to use the appropriate quotes when referring to argument names vs. argument values. As an example, consider the docstring for the `train_skll_model <https://rsmtool.readthedocs.io/en/main/api.html#rsmtool.modeler.Modeler.train_skll_model>`_ method of the ``rsmtool.modeler.Modeler`` class. Note that string argument values are enclosed in double quotes (e.g., "csv", "neg_mean_squared_error") whereas values of other built-in types are written as literals (e.g., ``True``, ``False``, ``None``). Note also that if one had to refer to an argument name in the docstring, this referent should be written as a literal. In general, we strongly encourage looking at the docstrings in the existing code to make sure that new docstrings follow the same practices.
69
+
#. All classes, functions, and methods in the main code files have `numpy-formatted docstrings <https://numpydoc.readthedocs.io/en/latest/format.html>`_ that comply with `PEP 257 <https://peps.python.org/pep-0257/>`_. This is enforced via the `pydocstyle <http://www.pydocstyle.org/en/stable/>`_ pre-commit check. Additionally, when writing docstrings, make sure to use the appropriate quotes when referring to argument names vs. argument values. As an example, consider the docstring for the `train_skll_model <https://rsmtool.readthedocs.io/en/main/api.html#rsmtool.modeler.Modeler.train_skll_model>`_ method of the ``rsmtool.modeler.Modeler`` class. Note that string argument values are enclosed in double quotes (e.g., "csv", "neg_mean_squared_error") whereas values of other built-in types are written as literals (e.g., ``True``, ``False``, ``None``). Note also that if one had to refer to an argument name in the docstring, this referent should be written as a literal. In general, we strongly encourage looking at the docstrings in the existing code to make sure that new docstrings follow the same practices.
Copy file name to clipboardExpand all lines: doc/evaluation.rst
+7-7Lines changed: 7 additions & 7 deletions
Original file line number
Diff line number
Diff line change
@@ -110,7 +110,7 @@ Note that in this case the variances and covariance are computed by dividing by
110
110
111
111
QWK is computed using :ref:`rsmtool.utils.quadratic_weighted_kappa<qwk_api>` with ``ddof`` set to ``0``.
112
112
113
-
See `Haberman (2019) <https://onlinelibrary.wiley.com/doi/abs/10.1002/ets2.12258>`_ for the full derivation of this formula. The discrete case is simply treated as a special case of the continuous one.
113
+
See `Haberman (2019) <https://www.sciencedirect.com/science/article/pii/S0093691X10000233>`_ for the full derivation of this formula. The discrete case is simply treated as a special case of the continuous one.
114
114
115
115
.. note::
116
116
@@ -149,7 +149,7 @@ SMD between system and human scores is computed using :ref:`rsmtool.utils.standa
149
149
150
150
.. note::
151
151
152
-
In RSMTool v6.x and earlier SMD was computed with the ``method`` argument set to ``"williamson"`` as described in `Williamson et al. (2012) <https://onlinelibrary.wiley.com/doi/full/10.1111/j.1745-3992.2011.00223.x>`_. The values computed by RSMTool starting with v7.0 will be *different* from those computed by earlier versions.
152
+
In RSMTool v6.x and earlier SMD was computed with the ``method`` argument set to ``"williamson"`` as described in `Williamson et al. (2012) <https://eric.ed.gov/?id=EJ959585>`_. The values computed by RSMTool starting with v7.0 will be *different* from those computed by earlier versions.
153
153
154
154
155
155
.. _mse:
@@ -179,7 +179,7 @@ Accuracy Metrics (True score)
179
179
180
180
According to test theory, an observed score is a combination of the true score :math:`T` and a measurement error. The true score cannot be observed, but its distribution parameters can be estimated from observed scores. Such an estimation requires that two human scores be available for *at least a* subset of responses in the evaluation set since these are necessary to estimate the measurement error component.
181
181
182
-
Evaluating system against true score produces performance estimates that are robust to errors in human scores and remain stable even when human-human agreeement varies (see `Loukina et al. (2020) <https://www.aclweb.org/anthology/2020.bea-1.2/>`_.
182
+
Evaluating system against true score produces performance estimates that are robust to errors in human scores and remain stable even when human-human agreeement varies (see `Loukina et al. (2020) <https://aclanthology.org/2020.bea-1.2/>`_.
183
183
184
184
The true score evaluations computed by RSMTool are available in the :ref:`intermediate file<rsmtool_true_score_eval>` ``true_score_eval``.
185
185
@@ -208,7 +208,7 @@ and :math:`\sigma_T^2` is estimated as:
The PRMSE formula implemented in RSMTool is more general and can also handle the case where the number of available ratings varies across the responses (e.g. **only a subset of responses is double-scored**). While ``rsmtool`` and ``rsmeval`` only support evaluations with two raters, the implementation of the PRMSE formula available via the :ref:`API<prmse_api>` supports cases where some of the responses have **more than two** ratings available. The formula was derived by Matt S. Johnson and is explained in more detail in `Loukina et al. (2020) <https://www.aclweb.org/anthology/2020.bea-1.2/>`_.
211
+
The PRMSE formula implemented in RSMTool is more general and can also handle the case where the number of available ratings varies across the responses (e.g. **only a subset of responses is double-scored**). While ``rsmtool`` and ``rsmeval`` only support evaluations with two raters, the implementation of the PRMSE formula available via the :ref:`API<prmse_api>` supports cases where some of the responses have **more than two** ratings available. The formula was derived by Matt S. Johnson and is explained in more detail in `Loukina et al. (2020) <https://aclanthology.org/2020.bea-1.2/>`_.
212
212
213
213
In this case, the variance of rater errors is computed as a pooled variance estimator.
214
214
@@ -262,7 +262,7 @@ In some cases, it may be appropriate to compute variance of human errors using a
262
262
Fairness
263
263
~~~~~~~~
264
264
265
-
Fairness of automated scores is an important component of RSMTool evaluations (see `Madnani et al, 2017 <https://www.aclweb.org/anthology/W17-1605/>`_).
265
+
Fairness of automated scores is an important component of RSMTool evaluations (see `Madnani et al, 2017 <https://aclanthology.org/W17-1605/>`_).
266
266
267
267
When defining an experiment, the RSMTool user has the option of specifying which subgroups should be considered for such evaluations using :ref:`subgroups<subgroups_rsmtool>` field. These subgroups are then used in all fairness evaluations.
268
268
@@ -308,7 +308,7 @@ DSM is computed using :ref:`rsmtool.utils.difference_of_standardized_means<dsm_a
308
308
Additional fairness evaluations
309
309
+++++++++++++++++++++++++++++++
310
310
311
-
Starting with v7.0, RSMTool includes additional fairness analyses suggested in `Loukina, Madnani, & Zechner, 2019 <https://www.aclweb.org/anthology/W19-4401/>`_. The computed metrics from these analyses are available in :ref:`intermediate files<rsmtool_fairness_eval>` ``fairness_metrics_by_<SUBGROUP>``.
311
+
Starting with v7.0, RSMTool includes additional fairness analyses suggested in `Loukina, Madnani, & Zechner, 2019 <https://aclanthology.org/W19-4401/>`_. The computed metrics from these analyses are available in :ref:`intermediate files<rsmtool_fairness_eval>` ``fairness_metrics_by_<SUBGROUP>``.
312
312
313
313
These include:
314
314
@@ -372,4 +372,4 @@ Therefore, SMD between two human scores is computed using :ref:`rsmtool.utils.st
372
372
373
373
.. note::
374
374
375
-
In RSMTool v6.x and earlier, SMD was computed with the ``method`` argument set to ``"williamson"`` as described in `Williamson et al. (2012) <https://onlinelibrary.wiley.com/doi/full/10.1111/j.1745-3992.2011.00223.x>`_. Starting with v7.0, the values computed by RSMTool will be *different* from those computed by earlier versions.
375
+
In RSMTool v6.x and earlier, SMD was computed with the ``method`` argument set to ``"williamson"`` as described in `Williamson et al. (2012) <https://eric.ed.gov/?id=EJ959585>`_. Starting with v7.0, the values computed by RSMTool will be *different* from those computed by earlier versions.
Automated scoring of written and spoken responses is a growing field in educational natural language processing. Automated scoring engines employ machine learning models to predict scores for such responses based on features extracted from the text/audio of these responses. Examples of automated scoring engines include `MI Write <https://measurementinc.com/miwrite>`_ for written responses and `SpeechRater <https://www.ets.org/research/policy_research_reports/publications/report/2008/hukv>`_ for spoken responses.
18
+
Automated scoring of written and spoken responses is a growing field in educational natural language processing. Automated scoring engines employ machine learning models to predict scores for such responses based on features extracted from the text/audio of these responses. Examples of automated scoring engines include `MI Write <https://measurementinc.com/miwrite>`_ for written responses and `SpeechRater <https://www.ets.org/research/policy_research_reports/publications/report/2008/hukv.html>`_ for spoken responses.
19
19
20
20
RSMTool is a python package which automates and combines in a *single* :doc:`pipeline <pipeline>` multiple analyses that are commonly conducted when building and evaluating automated scoring models. The output of RSMTool is a comprehensive, customizable HTML statistical report that contains the outputs of these multiple analyses. While RSMTool does make it really simple to run this set of standard analyses using a single command, it is also fully customizable and allows users to easily exclude unneeded analyses, modify the standard analyses, and even include custom analyses in the report.
Copy file name to clipboardExpand all lines: doc/intermediate_files_rsmeval.rst.inc
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -136,7 +136,7 @@ Evaluations based on test theory
136
136
Additional fairness analyses
137
137
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
138
138
139
-
These files contain the results of additional fairness analyses suggested in suggested in `Loukina, Madnani, & Zechner, 2019 <https://www.aclweb.org/anthology/W19-4401/>`_.
139
+
These files contain the results of additional fairness analyses suggested in suggested in `Loukina, Madnani, & Zechner, 2019 <https://aclanthology.org/W19-4401/>`_.
140
140
141
141
- ``<METRICS>_by_<SUBGROUP>.ols``: a serialized object of type ``pandas.stats.ols.OLS`` containing the fitted model for estimating the variance attributed to a given subgroup membership for a given metric. The subgroups are defined by the :ref:`configuration file<subgroups_eval>`. The metrics are ``osa`` (overall score accuracy), ``osd`` (overall score difference), and ``csd`` (conditional score difference).
Copy file name to clipboardExpand all lines: doc/intermediate_files_rsmtool.rst.inc
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -260,7 +260,7 @@ Evaluations based on test theory
260
260
Additional fairness analyses
261
261
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
262
262
263
-
These files contain the results of additional fairness analyses suggested in suggested in `Loukina, Madnani, & Zechner, 2019 <https://www.aclweb.org/anthology/W19-4401/>`_.
263
+
These files contain the results of additional fairness analyses suggested in suggested in `Loukina, Madnani, & Zechner, 2019 <https://aclanthology.org/W19-4401/>`_.
264
264
265
265
- ``<METRICS>_by_<SUBGROUP>.ols``: a serialized object of type ``pandas.stats.ols.OLS`` containing the fitted model for estimating the variance attributed to a given subgroup membership for a given metric. The subgroups are defined by the :ref:`configuration file<subgroups_rsmtool>`. The metrics are ``osa`` (overall score accuracy), ``osd`` (overall score difference), and ``csd`` (conditional score difference).
Copy file name to clipboardExpand all lines: doc/internal/release_process.rst
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -34,7 +34,7 @@ This process is only meant for the project administrators, not users and develop
34
34
35
35
#. Build the PyPI source and wheel distributions using ``python setup.py sdist build`` and ``python setup.py bdist_wheel build`` respectively. Note that you should delete the ``build`` directory after running the ``sdist`` command and before running the ``bdist_wheel`` command.
36
36
37
-
#. Upload the source and wheel distributions to TestPyPI using ``twine upload --repository testpypi dist/*``. You will need to have the ``twine`` package installed and set up your ``$HOME/.pypirc`` correctly. See details `here <https://packaging.python.org/guides/using-testpypi/>`__. You will need to have the appropriate permissions for the ``ets`` organization on TestPyPI.
37
+
#. Upload the source and wheel distributions to TestPyPI using ``twine upload --repository testpypi dist/*``. You will need to have the ``twine`` package installed and set up your ``$HOME/.pypirc`` correctly. See details `here <https://packaging.python.org/en/latest/guides/using-testpypi/>`__. You will need to have the appropriate permissions for the ``ets`` organization on TestPyPI.
Copy file name to clipboardExpand all lines: doc/who.rst
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -5,7 +5,7 @@ Who is RSMTool for?
5
5
6
6
We expect the primary users of RSMTool to be researchers working on developing new automated scoring engines or on improving existing ones. Here's the most common scenario.
7
7
8
-
A group of researchers already *has* a set of responses such as essays or recorded spoken responses which have already been assigned numeric scores by human graders. They have also processed these responses and extracted a set of (numeric) features using systems such as `Coh-Metrix <http://cohmetrix.com/>`_, `TextEvaluator <https://textevaluator.ets.org/TextEvaluator/>`_, `OpenSmile <https://www.audeering.com/research/opensmile/>`_, or using their own custom text/speech processing pipeline. They wish to understand how well the set of chosen features can predict the human score.
8
+
A group of researchers already *has* a set of responses such as essays or recorded spoken responses which have already been assigned numeric scores by human graders. They have also processed these responses and extracted a set of (numeric) features using systems such as `Coh-Metrix <https://soletlab.asu.edu/coh-metrix/>`_, `TextEvaluator <https://textevaluator.ets.org/TextEvaluator/>`_, `OpenSmile <https://www.audeering.com/research/opensmile/>`_, or using their own custom text/speech processing pipeline. They wish to understand how well the set of chosen features can predict the human score.
9
9
10
10
They can then run an RSMTool "experiment" to build a regression-based scoring model (using one of many available regressors) and produce a report. The report includes descriptive statistics for all their features, diagnostic information about the trained regression model, and a comprehensive evaluation of model performance on a held-out set of responses.
0 commit comments