Skip to content

Conversation

@sueoglu
Copy link
Collaborator

@sueoglu sueoglu commented Oct 24, 2025

fixes #950

  • improved qc_metrics function with new metrics:

  • in _compute_obs_metrics :
    unique_values_abs
    unique_values_ratio
    entropy_of_missingness

  • in _compute_var_metrics :
    unique_values_abs
    unique_values_ratio
    entropy_of_missingness
    coefficient_of_variation
    is_constant
    constant_variable_ratio
    range_ratio

  • updated tests accordingly with new metrics

TODO

  • porting to 3D

@Zethson Zethson marked this pull request as draft October 24, 2025 13:30
@eroell
Copy link
Collaborator

eroell commented Nov 21, 2025

in _compute_obs_metrics :
unique_values_abs
unique_values_ratio

These only make sense for categorical data, since for floats this will be not very meaningful.
For ehrapy to know about categorical data, infer_feature_types must be called, and I think it would be nice to require this for as little functions as possible.

entropy_of_missingness

Cool

in _compute_var_metrics :
unique_values_abs
unique_values_ratio

Comment on categorical vs numeric above applies

entropy_of_missingness

Cool from above applies

coefficient_of_variation
is_constant
constant_variable_ratio
range_ratio
skewness
kurtosis

All of this require knowledge on categorical and numerical variables - comment above applies.

What do you think about having by default a qc metrics which does only compute the things without feature type information needed; and have e.g. an argument for qc metrics to be computed that is a list, and when someone wants the fancy stuff you suggest here that requires numeric/categorical distinction, they'd need to run infer_feature_types first?

@eroell
Copy link
Collaborator

eroell commented Nov 23, 2025

Can you also while resolving merge-conflicts move the ARRAY_TYPES variable to compat.py please? :)

@eroell eroell mentioned this pull request Nov 23, 2025
4 tasks
Copy link
Collaborator

@eroell eroell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some intermediate comments - not sure you already asked Andreas to review or still refining things :)

qc_vars: Collection[str] = (),
*,
layer: str | None = None,
advanced: bool = False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be made two arguments
observation_level and variable_level, which take lists of strings and by default the lists are what the current default is?

It would seem to me a bit more readable than "advanced"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, I'd try to keep the number of parameters as low as possible. Can we come up with design where these parameters do not exist? Like a scenario where some things would be skipped over unless the feature specs were calculated but it doesn't crash.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the function computes too many things at once where it adds even more complexity computing 8 metrics in some cases and 12 in another without changing the passed argument. What do you think? I don't have a strong opinion here

The way to address your design that I'd see would be to document well what will be computed if no feature types are found, and what will be computed additionally if they are found.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah kinda like that. If possible, we should purge all and any parameters. Users rarely read API docs.

@sueoglu sueoglu requested a review from agerardy November 26, 2025 17:52
Copy link
Collaborator

@agerardy agerardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't find any big issues, looks pretty good to me already. tests seem to cover everything as far as I understand it.

@Zethson
Copy link
Member

Zethson commented Nov 28, 2025

Eventually, I'd like to do a final review because there's a few things that I think need to be changed.

Could you please update the PR description?

Copy link
Collaborator

@eroell eroell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is on a good path.

When all the current comments are addressed, the final review that @Zethson suggested can start. :)

missing_cat = _compute_missing_values(mtx_cat, axis=1)
valid_counts = mtx_cat.shape[1] - missing_cat

elif original_mtx.ndim == 3:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this branching here and the for loop below be replaced by using _apply_over_time_axis?



@pytest.fixture
def missing_values_edata_3d(obs_data, var_data_adv):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this fixture be used instead, to not create an additional test dataset?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using this fixture directly, but it fails during EHRData construction so I'll fix it and use it that way

assert np.array_equal(modification_copy.var[key], adata.var[key])


@pytest.mark.parametrize("array_type", ARRAY_TYPES_NONNUMERIC)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the time, we care about tests for the public API.
You have added one above, which is great. From what I see, the test above tests all of what _compute_obs_metrics returns, and I think this test for the private function _compute_obs_metrics can be deleted.

assert np.allclose(obs_metrics["entropy_of_missingness"].values, np.array([0.9183, 0.9183]))


def test_obs_qc_metrics_3D(missing_values_edata_3d):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the comment that we'd like to test public API mainly:

It is neater to write a test alike test_qc_metrics_vanilla_advanced, for 3D.

With this, this test for _compute_obs_metrics becomes obsolete. (You can reuse the things here for this test of the public function)

Here, the test of the private function passes - but it misses that the public function is not 3D enabled because the decorator is accidentally still there ;)

assert np.allclose(obs_metrics["entropy_of_missingness"].values, np.array([0.9183, 1.0]))


@pytest.mark.parametrize("array_type", ARRAY_TYPES_NONNUMERIC)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You already test for advanced above - I think this test for the private function is obsolete, too.

assert np.allclose(obs_metrics["entropy_of_missingness"].values, np.array([0.9183, 0.9183]))


def test_obs_qc_metrics_advanced_3D(missing_values_edata_3d):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be removed as well in favor of the suggested test for the public API

assert np.allclose(obs_metrics["entropy_of_missingness"].values, np.array([0.9183, 1.0]))


@pytest.mark.parametrize("array_type", ARRAY_TYPES_NONNUMERIC)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything for _obs_qc_metrics applies to _var_qc_metrics here, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Longitudinal qc_metrics

5 participants