-
Notifications
You must be signed in to change notification settings - Fork 38
Longitudinal and new qc_metrics #967
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
These only make sense for categorical data, since for floats this will be not very meaningful.
Cool
Comment on categorical vs numeric above applies
Cool from above applies
All of this require knowledge on categorical and numerical variables - comment above applies. What do you think about having by default a qc metrics which does only compute the things without feature type information needed; and have e.g. an argument for qc metrics to be computed that is a list, and when someone wants the fancy stuff you suggest here that requires numeric/categorical distinction, they'd need to run infer_feature_types first? |
|
Can you also while resolving merge-conflicts move the ARRAY_TYPES variable to |
…to enhancement/issue-950 after new updates from the main
for more information, see https://pre-commit.ci
…rical & numerical vars, tests applied to both advanced=True and advanced=False
…to enhancement/issue-950
eroell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some intermediate comments - not sure you already asked Andreas to review or still refining things :)
| qc_vars: Collection[str] = (), | ||
| *, | ||
| layer: str | None = None, | ||
| advanced: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could this be made two arguments
observation_level and variable_level, which take lists of strings and by default the lists are what the current default is?
It would seem to me a bit more readable than "advanced"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible, I'd try to keep the number of parameters as low as possible. Can we come up with design where these parameters do not exist? Like a scenario where some things would be skipped over unless the feature specs were calculated but it doesn't crash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the function computes too many things at once where it adds even more complexity computing 8 metrics in some cases and 12 in another without changing the passed argument. What do you think? I don't have a strong opinion here
The way to address your design that I'd see would be to document well what will be computed if no feature types are found, and what will be computed additionally if they are found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah kinda like that. If possible, we should purge all and any parameters. Users rarely read API docs.
agerardy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't find any big issues, looks pretty good to me already. tests seem to cover everything as far as I understand it.
|
Eventually, I'd like to do a final review because there's a few things that I think need to be changed. Could you please update the PR description? |
for more information, see https://pre-commit.ci
…to enhancement/issue-950
eroell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is on a good path.
When all the current comments are addressed, the final review that @Zethson suggested can start. :)
| missing_cat = _compute_missing_values(mtx_cat, axis=1) | ||
| valid_counts = mtx_cat.shape[1] - missing_cat | ||
|
|
||
| elif original_mtx.ndim == 3: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this branching here and the for loop below be replaced by using _apply_over_time_axis?
|
|
||
|
|
||
| @pytest.fixture | ||
| def missing_values_edata_3d(obs_data, var_data_adv): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this fixture be used instead, to not create an additional test dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried using this fixture directly, but it fails during EHRData construction so I'll fix it and use it that way
| assert np.array_equal(modification_copy.var[key], adata.var[key]) | ||
|
|
||
|
|
||
| @pytest.mark.parametrize("array_type", ARRAY_TYPES_NONNUMERIC) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the time, we care about tests for the public API.
You have added one above, which is great. From what I see, the test above tests all of what _compute_obs_metrics returns, and I think this test for the private function _compute_obs_metrics can be deleted.
| assert np.allclose(obs_metrics["entropy_of_missingness"].values, np.array([0.9183, 0.9183])) | ||
|
|
||
|
|
||
| def test_obs_qc_metrics_3D(missing_values_edata_3d): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following the comment that we'd like to test public API mainly:
It is neater to write a test alike test_qc_metrics_vanilla_advanced, for 3D.
With this, this test for _compute_obs_metrics becomes obsolete. (You can reuse the things here for this test of the public function)
Here, the test of the private function passes - but it misses that the public function is not 3D enabled because the decorator is accidentally still there ;)
| assert np.allclose(obs_metrics["entropy_of_missingness"].values, np.array([0.9183, 1.0])) | ||
|
|
||
|
|
||
| @pytest.mark.parametrize("array_type", ARRAY_TYPES_NONNUMERIC) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You already test for advanced above - I think this test for the private function is obsolete, too.
| assert np.allclose(obs_metrics["entropy_of_missingness"].values, np.array([0.9183, 0.9183])) | ||
|
|
||
|
|
||
| def test_obs_qc_metrics_advanced_3D(missing_values_edata_3d): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be removed as well in favor of the suggested test for the public API
| assert np.allclose(obs_metrics["entropy_of_missingness"].values, np.array([0.9183, 1.0])) | ||
|
|
||
|
|
||
| @pytest.mark.parametrize("array_type", ARRAY_TYPES_NONNUMERIC) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything for _obs_qc_metrics applies to _var_qc_metrics here, too.
fixes #950
improved qc_metrics function with new metrics:
in
_compute_obs_metrics:unique_values_absunique_values_ratioentropy_of_missingnessin
_compute_var_metrics:unique_values_absunique_values_ratioentropy_of_missingnesscoefficient_of_variationis_constantconstant_variable_ratiorange_ratioupdated tests accordingly with new metrics
TODO