Skip to content

Conversation

crispy-wonton
Copy link
Contributor

@crispy-wonton crispy-wonton commented Jun 9, 2025

Fixes #111

Summary

Compare the MCS and MCS-EPC joined output files from the daps and core processing pipelines in a notebook to identify any differences.

New files:

  • analysis/compare_processing/compare_mcs_installations_processing.py - notebook to compare processing outputs

Please note: the code is very repetitive due to using a notebook layout so I have left comments on the file on github to show where code is repeated from a previous section to save you from reviewing the exact same code twice!

Instructions for reviewer:

To create the notebook, you can use the following lines of code

pip install jupytext
jupytext --to notebook asf_core_data/analysis/compare_processing/compare_mcs_installations_processing.py

You can then run the notebook as normal from your chosen IDE.

Please pay special attention to:

  • The files that are being loaded to be sure that we are comparing the correct datasets
  • The creation of unique IDs
  • The preprocessing on the last 2 comparison datasets (MCS-EPC full and MCS-EPC most relevant) before row-by-row comparison

Is there anything missing that you think would be good to investigate?

Checklist:

  • I have refactored my code out from notebooks/
  • I have checked the code runs
  • I have tested the code
  • I have run pre-commit and addressed any issues not automatically fixed
  • I have merged any new changes from dev
  • I have documented the code
    • Major functions have docstrings
    • Appropriate information has been added to READMEs
  • I have explained the feature in this PR or (better) in output/reports/
  • I have requested a code review

raw_daps_df = pd.read_parquet(daps_epc_path)

# %%
# Preprocess datasets to make them comparable
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The processing below is exactly the same as for the MCS-EPC full dataset

Comment on lines +86 to +91
# %%
# Compare with y-data profiling
core_report = ProfileReport(core_df, title=f"Core {dataset.upper()}", minimal=True)
daps_report = ProfileReport(daps_df, title=f"Daps {dataset.upper()}", minimal=True)
comparison_report = core_report.compare(daps_report)
comparison_report.to_file(f"{dataset}_comparison.html")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These rows are identical to lines 30-33.

@crispy-wonton crispy-wonton requested a review from sofiapinto June 9, 2025 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Write notebook to compare different MCS processing versions
1 participant