Longitudinal and new qc_metrics #967

sueoglu · 2025-10-24T11:49:29Z

fixes #950

improved qc_metrics function with new metrics:
in _compute_obs_metrics :
unique_values_abs
unique_values_ratio
entropy_of_missingness
in _compute_var_metrics :
unique_values_abs
unique_values_ratio
entropy_of_missingness
coefficient_of_variation
is_constant
constant_variable_ratio
range_ratio
updated tests accordingly with new metrics

TODO

porting to 3D

…dated

ehrapy/preprocessing/_quality_control.py

eroell · 2025-11-21T21:01:30Z

in _compute_obs_metrics :
unique_values_abs
unique_values_ratio

These only make sense for categorical data, since for floats this will be not very meaningful.
For ehrapy to know about categorical data, infer_feature_types must be called, and I think it would be nice to require this for as little functions as possible.

entropy_of_missingness

Cool

in _compute_var_metrics :
unique_values_abs
unique_values_ratio

Comment on categorical vs numeric above applies

entropy_of_missingness

Cool from above applies

coefficient_of_variation
is_constant
constant_variable_ratio
range_ratio
skewness
kurtosis

All of this require knowledge on categorical and numerical variables - comment above applies.

What do you think about having by default a qc metrics which does only compute the things without feature type information needed; and have e.g. an argument for qc metrics to be computed that is a list, and when someone wants the fancy stuff you suggest here that requires numeric/categorical distinction, they'd need to run infer_feature_types first?

for more information, see https://pre-commit.ci

eroell · 2025-11-23T15:14:47Z

Can you also while resolving merge-conflicts move the ARRAY_TYPES variable to compat.py please? :)

…dated

…to enhancement/issue-950 after new updates from the main

for more information, see https://pre-commit.ci

…rical & numerical vars, tests applied to both advanced=True and advanced=False

…to enhancement/issue-950

eroell

Some intermediate comments - not sure you already asked Andreas to review or still refining things :)

eroell · 2025-11-26T17:46:45Z

ehrapy/preprocessing/_quality_control.py

    qc_vars: Collection[str] = (),
    *,
    layer: str | None = None,
+    advanced: bool = False,


could this be made two arguments
observation_level and variable_level, which take lists of strings and by default the lists are what the current default is?

It would seem to me a bit more readable than "advanced"

If possible, I'd try to keep the number of parameters as low as possible. Can we come up with design where these parameters do not exist? Like a scenario where some things would be skipped over unless the feature specs were calculated but it doesn't crash.

I wonder if the function computes too many things at once where it adds even more complexity computing 8 metrics in some cases and 12 in another without changing the passed argument. What do you think? I don't have a strong opinion here

The way to address your design that I'd see would be to document well what will be computed if no feature types are found, and what will be computed additionally if they are found.

Yeah kinda like that. If possible, we should purge all and any parameters. Users rarely read API docs.

ehrapy/preprocessing/_quality_control.py

agerardy

Didn't find any big issues, looks pretty good to me already. tests seem to cover everything as far as I understand it.

ehrapy/preprocessing/_quality_control.py

Zethson · 2025-11-28T11:31:21Z

Eventually, I'd like to do a final review because there's a few things that I think need to be changed.

Could you please update the PR description?

for more information, see https://pre-commit.ci

…to enhancement/issue-950

…this case

eroell

This is on a good path.

When all the current comments are addressed, the final review that @Zethson suggested can start. :)

eroell · 2025-11-29T17:01:17Z

ehrapy/preprocessing/_quality_control.py

+                missing_cat = _compute_missing_values(mtx_cat, axis=1)
+                valid_counts = mtx_cat.shape[1] - missing_cat
+
+            elif original_mtx.ndim == 3:


Can this branching here and the for loop below be replaced by using _apply_over_time_axis?

ehrapy/preprocessing/_quality_control.py

eroell · 2025-11-29T17:15:21Z

tests/conftest.py

+
+
+@pytest.fixture
+def missing_values_edata_3d(obs_data, var_data_adv):


Could this fixture be used instead, to not create an additional test dataset?

I tried using this fixture directly, but it fails during EHRData construction so I'll fix it and use it that way

tests/conftest.py

eroell · 2025-11-29T17:20:57Z

tests/preprocessing/test_quality_control.py

+        assert np.array_equal(modification_copy.var[key], adata.var[key])
+
+
 @pytest.mark.parametrize("array_type", ARRAY_TYPES_NONNUMERIC)


Most of the time, we care about tests for the public API.
You have added one above, which is great. From what I see, the test above tests all of what _compute_obs_metrics returns, and I think this test for the private function _compute_obs_metrics can be deleted.

eroell · 2025-11-29T17:24:40Z

tests/preprocessing/test_quality_control.py

+    assert np.allclose(obs_metrics["entropy_of_missingness"].values, np.array([0.9183, 0.9183]))
+
+
+def test_obs_qc_metrics_3D(missing_values_edata_3d):


Following the comment that we'd like to test public API mainly:

It is neater to write a test alike test_qc_metrics_vanilla_advanced, for 3D.

With this, this test for _compute_obs_metrics becomes obsolete. (You can reuse the things here for this test of the public function)

Here, the test of the private function passes - but it misses that the public function is not 3D enabled because the decorator is accidentally still there ;)

eroell · 2025-11-29T17:25:39Z

tests/preprocessing/test_quality_control.py

+    assert np.allclose(obs_metrics["entropy_of_missingness"].values, np.array([0.9183, 1.0]))
+
+
+@pytest.mark.parametrize("array_type", ARRAY_TYPES_NONNUMERIC)


You already test for advanced above - I think this test for the private function is obsolete, too.

eroell · 2025-11-29T17:26:20Z

tests/preprocessing/test_quality_control.py

+    assert np.allclose(obs_metrics["entropy_of_missingness"].values, np.array([0.9183, 0.9183]))
+
+
+def test_obs_qc_metrics_advanced_3D(missing_values_edata_3d):


Can be removed as well in favor of the suggested test for the public API

eroell · 2025-11-29T17:26:50Z

tests/preprocessing/test_quality_control.py

+    assert np.allclose(obs_metrics["entropy_of_missingness"].values, np.array([0.9183, 1.0]))


 @pytest.mark.parametrize("array_type", ARRAY_TYPES_NONNUMERIC)


Everything for _obs_qc_metrics applies to _var_qc_metrics here, too.

Öykü Süoglu added 2 commits October 24, 2025 13:42

qc_metrics methods improved with new metrics, tests and docstrings up…

b1de27d

…dated

skewness and kurtosis left out

dd3e67a

Zethson marked this pull request as draft October 24, 2025 13:30

Öykü Süoglu added 2 commits October 24, 2025 17:01

is_constant metric for categorical variables datatype set to boolean

149b6fe

small error fixed

e9d3fc6

eroell reviewed Nov 21, 2025

View reviewed changes

ehrapy/preprocessing/_quality_control.py Outdated Show resolved Hide resolved

ehrapy/preprocessing/_quality_control.py Outdated Show resolved Hide resolved

Zethson and others added 2 commits November 22, 2025 14:25

Merge branch 'main' into enhancement/issue-950

c8a1319

[pre-commit.ci] auto fixes from pre-commit.com hooks

de05373

for more information, see https://pre-commit.ci

eroell mentioned this pull request Nov 23, 2025

Simple Impute for timeseries #975

Merged

4 tasks

Öykü Süoglu and others added 10 commits November 24, 2025 18:44

qc_metrics methods improved with new metrics, tests and docstrings up…

e2fc04a

…dated

skewness and kurtosis left out

9e41a83

is_constant metric for categorical variables datatype set to boolean

dd2bee9

small error fixed

5e143c2

Merge branch 'enhancement/issue-950' of github.com:theislab/ehrapy in…

edaf8d9

…to enhancement/issue-950 after new updates from the main

[pre-commit.ci] auto fixes from pre-commit.com hooks

f5f36f1

for more information, see https://pre-commit.ci

new argument advanced added to qc_metrics, distinction between catego…

a95cfe9

…rical & numerical vars, tests applied to both advanced=True and advanced=False

array types moved from conftest.py to _compat.py

3839c2e

Merge branch 'enhancement/issue-950' of github.com:theislab/ehrapy in…

978a168

…to enhancement/issue-950

vanilla test advanced added and default fixed

5ba8e6b

eroell reviewed Nov 26, 2025

View reviewed changes

sueoglu requested a review from agerardy November 26, 2025 17:52

agerardy reviewed Nov 27, 2025

View reviewed changes

ehrapy/preprocessing/_quality_control.py Show resolved Hide resolved

ehrapy/preprocessing/_quality_control.py Outdated Show resolved Hide resolved

fixed small things considering the reviews

ddb8931

Öykü Süoglu and others added 4 commits November 28, 2025 12:50

_apply_over_time_axis decorator for the functions added

736d8bc

[pre-commit.ci] auto fixes from pre-commit.com hooks

cb3bc2a

for more information, see https://pre-commit.ci

forgot to add after pre-commit made changes

5d86fc0

Merge branch 'enhancement/issue-950' of github.com:theislab/ehrapy in…

2d6d1f3

…to enhancement/issue-950

Öykü Süoglu added 3 commits November 28, 2025 14:26

undo the decorator for the metric functions, since it doesnt work in …

5ad274c

…this case

3d enabled qc metrics

f5e220d

tests for 3d qc_metrics

a8aff81

eroell reviewed Nov 29, 2025

View reviewed changes



		@pytest.fixture
		def missing_values_edata_3d(obs_data, var_data_adv):

		assert np.array_equal(modification_copy.var[key], adata.var[key])


		@pytest.mark.parametrize("array_type", ARRAY_TYPES_NONNUMERIC)

		assert np.allclose(obs_metrics["entropy_of_missingness"].values, np.array([0.9183, 0.9183]))


		def test_obs_qc_metrics_3D(missing_values_edata_3d):

		assert np.allclose(obs_metrics["entropy_of_missingness"].values, np.array([0.9183, 0.9183]))


		def test_obs_qc_metrics_advanced_3D(missing_values_edata_3d):

Longitudinal and new qc_metrics #967

Are you sure you want to change the base?

Longitudinal and new qc_metrics #967

Uh oh!

Conversation

sueoglu commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eroell commented Nov 21, 2025

Uh oh!

eroell commented Nov 23, 2025

Uh oh!

eroell left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

agerardy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Zethson commented Nov 28, 2025

Uh oh!

eroell left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sueoglu commented Oct 24, 2025 •

edited

Loading