Skip to content

Conversation

turbomam
Copy link
Member

@turbomam turbomam commented Aug 11, 2025

TO DO

One option for dealing with legacy data using units that we just don't want to approve of: a bertron-style key/value slot

DONE

  • add annotations.storage_units to slots where MIxS specifies a single preferred unit
  • add annotations.storage_units to slots where MIxS specifies multiple preferred units
  • write python tests for any unexpected annotation keys (hard-coded list)
  • write a python test ensuring that all of the annotations.storage_units values are in UnitEnum
  • write a python test that goes through src/data/valid/ and checks the asserted has_unit values against the annotations.storage_units value(s) for that the owning slot
  • demonstrate using titles relating a display unit like 'years' for cryptic UCUM units like a. THis PR is not thorough on that task yet.
  • create titles for units whose UCUM representatiosn aren't intuitive.

WON'T DO

  • consider implementing the slot/units association with range_expression instead of annotations. The implementation of this would be exploratory for any LInkML project, not just our units test case.

@Copilot Copilot AI review requested due to automatic review settings August 11, 2025 15:30
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds storage_units annotations to hundreds of measurement slots to provide unambiguous unit specifications for data storage. The changes ensure that slots with clearly defined measurement units have consistent UCUM-compatible storage unit annotations.

  • Adds storage_units annotations to 115+ measurement slots across the schema with unambiguous unit requirements
  • Updates core schema slots for nitrogen measurements with explicit storage unit specifications
  • Documents the systematic approach for adding storage units to slots with single, uncontested preferred units

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
src/schema/mixs.yaml Adds storage_units annotations to measurement slots with clear unit requirements
src/schema/core.yaml Adds storage_unit annotations to ammonium_nitrogen, nitrate_nitrogen, and nitrite_nitrogen slots
assets/yq-for-mixs_subset_modified.txt Documents yq commands for systematically adding storage_units annotations
assets/pref-unit-slots-claude-ucum-expanded-curated-with-has-problem.tsv Reference data showing unit analysis and problem identification

Copy link

github-actions bot commented Aug 11, 2025

PR Preview Action v1.6.2

🚀 View preview at
https://microbiomedata.github.io/nmdc-schema/pr-preview/pr-2599/

Built to branch gh-pages at 2025-09-03 16:31 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@turbomam
Copy link
Member Author

turbomam commented Aug 11, 2025

This yq command reports all of the annotations that have been used in the schema. I used it to construct ALLOWED_ANNOTATION_KEYS in tests/test_annotations.py

The query results are in the next comment box.

No further action is required.

shell
yq '.. | select(has("annotations")) as $item | $item.annotations | to_entries[] | [(($item | path | join(".")), .key, .value.tag // "", .value.value // .value // "")]| @tsv' nmdc_schema/nmdc_materialized_patterns.yaml | sort | uniq > annotations.tsv

@turbomam
Copy link
Member Author

turbomam commented Aug 11, 2025

See the yq query above.

This list of historical annotations was used to construct ALLOWED_ANNOTATION_KEYS in tests/test_annotations.py

No further action is required.

key classes enums slots Total Result
expected_value 4   496 500
file_name_pattern   57   57
occurrence     465 465
originally   15   15
preferred_unit 1   224 225
source     2 2
storage_units     202 202
tooltip 11   7 18
Total Result 16 72 1396 1484

the class expected_values and preferred_units may all come from slot_usages

the source keys are annotations of the tooltip annotations!

@turbomam
Copy link
Member Author

turbomam commented Aug 11, 2025

slots with a MIxS preferred_unit annotation but no storage_units annotation yet

yq '.slots | to_entries | map(select(.value.annotations.preferred_unit and (.value.annotations.storage_units == null))) | map(.key)' nmdc_schema/nmdc_materialized_patterns.yaml
- air_temp_regm
- api
- ceil_thermal_mass
- down_par
- efficiency_percent
- floor_thermal_mass
- inside_lux
- light_regm
- org_count_qpcr_info
- organism_count
- permeability
- rel_humidity_out
- samp_transport_cond
- solar_irradiance
- specific_humidity
- turbidity
- viscosity
- wall_thermal_mass

@turbomam
Copy link
Member Author

turbomam commented Aug 11, 2025

  • storage_units that may not be intuitive and might require a title assertion, to be used as the display value.
  1. a (line 356) - for years - This is indeed surprising! Using a (annum) instead of the more intuitive y or year
  2. "[g]" (line 770) - Standard gravity/g-force - the square brackets notation is surprising
  3. "[lb_av]" (line 724) - Pound (avoirdupois) with strange bracket notation
  4. "[kn_i]" (line 841) - Knot (nautical miles per hour) with the _i suffix
  5. "[sft_i]" (line 795) - Square feet with the _i suffix
  6. "[cft_i]" (line 800) - Cubic feet with the _i suffix
  7. "[in_i]" (line 850) - Inch with the _i suffix
  8. ratio (line 686) - Just "ratio" as a unit, which seems oddly generic
  9. RAD (line 791) - Radiation absorbed dose (different from rad for radian)
  10. NTU (line 718) - Nephelometric Turbidity Units - very domain-specific

@turbomam turbomam changed the title unambiguous, uncontested storage_units annotations Connecting units to slots with annotations Aug 12, 2025
@@ -404,30 +414,46 @@ enums:
description: The Unified Code for Units of Measure (UCUM) representation of atmosphere.
mV:
description: The Unified Code for Units of Measure (UCUM) representation of millivolt.
PSU:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should removing this PV have a migrator?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And all other PV changes.

@pkalita-lbl pkalita-lbl marked this pull request as draft August 19, 2025 18:48
@pkalita-lbl
Copy link
Collaborator

@samobermiller to help write a migrator for this

@samobermiller
Copy link
Contributor

@samobermiller to help write a migrator for this

waiting to work on this until additional schema changes mentioned in slack channel by @turbomam can be discussed at next meeting

@turbomam
Copy link
Member Author

turbomam commented Sep 2, 2025

turbomam and others added 4 commits September 2, 2025 13:12
…tyvalue-slots-that-only-have-one-clear-cut-mixs---ucum-mapping
Cleaned up test data file by removing obsolete JGI DNA properties that were
causing validation failures after recent schema changes:
- dna_cont_well: C2
- dna_absorb2: 2.02
- dna_absorb1: 2.02

These properties were removed from the schema in commits 2558ab9
"remove jgi only slots" but remained in the test data causing validation errors.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Recent schema changes added storage_units annotations to multiple slots:
- duration
- final_concentration
- input_mass
- lbc_thirty
- lbceq
- mass
- source_concentration
- subsurface_depth
- temperature
- volume

Updated test data files to comply with new storage_units constraints:
- lbc_thirty and lbceq: changed from mg/kg to [ppm] in both
  Biosample-possibly-exhaustive.yaml and Database-interleaved.yaml
- final_concentration: changed from ug/uL to umol/L in
  Extraction-proteomics.yaml and ChemicalConversionProcess-digest.yaml

Test data for other annotated slots (duration, input_mass, mass,
source_concentration, subsurface_depth, temperature, volume) was
already compliant with the new storage_units constraints.

Resolves unit validation test failures and ensures all test data
aligns with the storage_units feature implementation.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
…idation

This commit implements the storage_units feature for slots with clear MIXS→UCUM mappings:

Schema changes:
- Add storage_units annotations to 16 QuantityValue slots in core.yaml:
  * zinc, manganese, ammonium_nitrogen, nitrate_nitrogen, nitrite_nitrogen: [ppm]|mg/kg|mg/L
  * lbc_thirty, lbceq: [ppm]
  * subsurface_depth: m
  * mass: g
  * final_concentration: %|mmol/L|umol/L
  * source_concentration: %|mmol/L
  * duration: h|min
  * temperature: Cel
  * volume: mL|uL
  * bulk_elect_conductivity: mS/cm
  * input_mass: g

Test data fixes:
- Fix unit validation errors caused by new storage_units constraints:
  * lbc_thirty/lbceq: changed from mg/kg to [ppm]
  * final_concentration: changed from ug/uL to umol/L
  * photon_flux: changed from umol/s/m2 to umol/m2/s
  * tot_org_carb: changed from % to mg/L

All tests now pass (25/25) with proper unit validation enforcement.
@turbomam
Copy link
Member Author

turbomam commented Sep 2, 2025

I don't think it's really possible to assert a unit for compound slots like these

  • air_temp_regm
  • light_regm

So I have added a new annotation not_measurement_like

yq '.slots | to_entries | map(select(.value.annotations.preferred_unit and (.value.annotations.storage_units == null) and (.value.annotations.not_measurement_like == null))) | map(.key)' nmdc_schema/nmdc_materialized_patterns.yaml | sort
  • api
  • efficiency_percent
  • inside_lux
  • permeability
  • rel_humidity_out
  • specific_humidity

That yq command has been implemented as tests/test_units_alignment.py

@turbomam
Copy link
Member Author

turbomam commented Sep 2, 2025

🔴 Major MIxS Specification Problems:

  1. efficiency_percent - CRITICAL MISMATCH:
  • MIxS preferred unit: "micromole per liter"
  • Problem: Name says "percent" but unit is a concentration
  • no NMDC data so far
  1. inside_lux - NAME/UNIT MISMATCH:
  • MIxS preferred unit: "kilowatt per square metre"
  • Problem: "lux" suggests illuminance but unit is power density
  • no NMDC data so far
  1. Humidity slots - AMBIGUOUS PHRASING:
  • rel_humidity_out: "gram of air, kilogram of air"
    • no NMDC data so far
  • specific_humidity: "gram of air, kilogram of air"
    • no NMDC data so far
  • abs_air_humidity: "gram per gram, kilogram per kilogram, kilogram, pound"
  • Problem: Missing "per" makes it unclear (should be "gram per kilogram of air")

🟡 Database vs MIxS Discrepancies:

  1. abs_air_humidity:
  • MIxS units: "gram per gram, kilogram per kilogram, kilogram, pound"
  • Database units: "kPa"
  • SURPRISE: kPa is pressure, not humidity - suggests major data quality issue
  1. diss_oxygen:
  • MIxS units: "micromole per kilogram, milligram per liter"
  • Database units: "mL/L"
  • SURPRISE: mL/L (volume ratio) is unusual for dissolved oxygen concentration
  1. solar_irradiance:
  • MIxS units: "kilowatts per square meter per day, ergs per square centimeter per second"
  • Database units: "W/m2"
  • MODERATE SURPRISE: W/m2 is reasonable but not in the preferred list... and it doesn't have a time dimension

✅ MIxS specifies preferred_units but they can't be aligned with UCUM

No NMDC for either of these yet

  1. api: "degrees API" ✅ (petroleum industry standard)
  2. permeability: "mD" ✅ (millidarcy, standard permeability unit)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add storage_units annotations to QuantityValue slots that only have one clear-cut MIxS -> UCUM mapping
5 participants