Skip to content

Commit df4f7eb

Browse files
authored
Merge branch 'main' into finalize_deprecation_filetypeenum_permissible_values
2 parents 5ebcc44 + 0fb8d0b commit df4f7eb

27 files changed

+324
-1337
lines changed

CONTRIBUTING.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -123,6 +123,8 @@ how to deploy a test version of the schema documentation. This requires some bas
123123
- Anyone who is involved in writing migrations or otherwise checking data from MongoDB against the schema should be comfortable running make `make-rdf`.
124124
- The main [Makefile](Makefile) should in general not be edited. Instead, edits should be made to [project.Makefile](project.Makefile) (advanced contributors only)
125125

126+
> Advanced testing instructions for migrators can be found [here](nmdc_schema/migrators/README.md).
127+
126128
### Recording Decisions
127129

128130
- Use the [NMDC ADR Log](https://github.com/microbiomedata/NMDC_documentation/tree/main/decisions)

DEVELOPMENT.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -140,3 +140,5 @@ Here's a one-liner you can use to derive release artifacts (which are [stored in
140140
```shell
141141
docker compose run --rm -it --name nmdc-schema-builder app sh -c 'poetry install && make squeaky-clean all test'
142142
```
143+
144+
> Advanced testing instructions for migrators can be found [here](nmdc_schema/migrators/README.md).

nmdc_schema/migrators/README.md

Lines changed: 96 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,19 @@ databases between schemas.
99

1010
In this document, I'll refer to those Python classes as "migrators."
1111

12+
## Table Of Contents
13+
14+
- [Contents](#contents)
15+
- [Creating a migrator](#creating-a-migrator)
16+
- [Adding Migration Reporting](#adding-migration-reporting)
17+
- [Adding Transaction Support](#adding-transaction-support)
18+
- [Testing the migrator](#testing-the-migrator)
19+
* [Summary of steps to test a migrator with a local copy of the MongoDB database](#summary-of-steps-to-test-a-migrator-with-a-local-copy-of-the-mongodb-database)
20+
* [Running a migrator with Docker step-by-step](#running-a-migrator-with-docker-step-by-step)
21+
* [Summary of steps to test a migrator with the API](#summary-of-steps-to-test-a-migrator-with-the-api)
22+
* [Running a migrator with project.Makefile step-by-step](#running-a-migrator-with-projectmakefile-step-by-step)
23+
24+
1225
## Contents
1326

1427
This directory contains the following things:
@@ -241,14 +254,16 @@ To add MongoDB transaction support with commit/rollback functionality to your mi
241254
242255
## Testing the migrator
243256
244-
##### Summary of steps to test a migrator:
257+
There are two documented ways to test migrators against copies of the database. One way involves loading a database dump into a containerized MongoDB server and running the migrator against that database and another uses the runtime API to gather collections of interest via `project.Makfile`. Either way is a valid way to test migrators, but you should understand what each version is doing to ensure you are testing properly.
258+
259+
### Summary of steps to test a migrator with a local copy of the MongoDB database:
245260
246261
1. Create a local copy of the MongoDB database with a schema that conforms to the release from which you are migrating.
247262
2. Check that the database has been loaded correctly.
248263
3. Run the migrator against the test database.
249264
4. Run validation checks against the migrated database.
250265
251-
##### Running a migrator step-by-step:
266+
### Running a migrator with Docker step-by-step:
252267
253268
1. **Set up Docker environment and MongoDB database**
254269
@@ -339,3 +354,82 @@ db.runCommand("listCollections").cursor.firstBatch
339354
# This commits the changes to the database
340355
% make run-migrator MIGRATOR=migrator_from_1_0_0_to_EXAMPLE ACTION=commit
341356
```
357+
358+
### Summary of steps to test a migrator with the API:
359+
360+
1. Run `make` command to test docstring and generate new schema file.
361+
2. Create a local copy of the latest schema release.
362+
3. Run API request to create a local copy of collections of interest.
363+
4. Run the migrator against the test database.
364+
5. Run validation checks against the migrated database.
365+
366+
All local files are saved to `local/`
367+
368+
### Running a migrator with project.Makefile step-by-step:
369+
370+
1. **Run doctests and create a local copy of the schema according to your local instance**
371+
372+
Each migrator should contain docstring tests. This step is important to catch syntax errors AND to **generate a new schema yaml file** to use in the local database tests. To run the docstring test and generate a new schema file run. This step also validates the schema and the example data in this repo.
373+
374+
```bash
375+
% make squeaky-clean test all
376+
```
377+
378+
2. **Run the test-migrator-on-database command with appropriate params**
379+
380+
The `test-migrator-on-database` command combines 3 separate commands into one. Using parameters, it removes the need to directly edit `project.Makefile` each time you test a new migrator.
381+
The following parameters are available:
382+
383+
- SELECTED_COLLECTIONS - specify the collections you want to download (i.e. collections that your migrator changes), each one separated by a space. The default is all collections EXCEPT `functional_annotation_agg`.
384+
- MIGRATOR - the name of the migrator file. DO NOT INCLUDE `.py` EXT
385+
- ENV - whether to gather data from the prod or dev runtime API environment. The default is prod.
386+
387+
**For testing partial migrators, you must reference the file that wraps the partials, not individual partials. All collections modified in any of the partial migrators should be selected in the SELECTED_COLLECTIONS parameter**
388+
389+
For example, if I wanted to test `migrator_from_11_6_1_to_11_7_0` and only download the `data_object_set` collection from the production database, I would run:
390+
391+
```bash
392+
% make test-migrator-on-database MIGRATOR=migrator_from_11_6_1_to_11_7_0 SELECTED_COLLECTIONS=data_object_set
393+
```
394+
395+
To download data via the _development_ instance of the NMDC Runtime and run the tests:
396+
397+
```bash
398+
% make test-migrator-on-database MIGRATOR=migrator_from_11_6_1_to_11_7_0 SELECTED_COLLECTIONS=data_object_set ENV=dev
399+
```
400+
401+
To specify multiple collections, separate their names with spaces:
402+
403+
```bash
404+
% make test-migrator-on-database MIGRATOR=migrator_from_11_6_1_to_11_7_0 SELECTED_COLLECTIONS=data_object_set biosample_set ENV=dev
405+
```
406+
407+
408+
409+
410+
> **NOTE**
411+
>`% make rdf-clean` will delete locally generated files from the testing process. This can be helpful if a bug was identified and the `make` commands need to be rerun after a change.
412+
413+
414+
That's it! Errors will output to `local/mongo_via_api_as_nmdc_database_validation.log` and there will be an alert in the terminal if this occurs.
415+
416+
3. **In-depth discussion of test-migrator**
417+
418+
As mentioned, the `test-migrator-on-database` command is comprised of three commands. Each command can be run separately outside of `test-migrator-on-database`. This may come in handy when you want to test a change to the migrator, but do not want to download the database again (saves time).
419+
420+
- `% make local/mongo_via_api_as_unvalidated_nmdc_database.yaml SELECTED_COLLECTIONS=`
421+
* This command creates a local dump of the selected collections and saves it to the path local/mongo_via_api_as_unvalidated_nmdc_database.yaml
422+
- `% make local/mongo_via_api_as_nmdc_database_after_migrator.yaml MIGRATOR=`
423+
* This command runs the migrator on the collections in `local/mongo_via_api_as_unvalidated_nmdc_database.yaml` and saves the changes to file path `local/mongo_via_api_as_nmdc_database_after_migrator.yaml`
424+
- `% make local/mongo_via_api_as_nmdc_database_validation.log`
425+
* This commands validates the migrated collections against `nmdc_schema/nmdc_materialized_patterns.yaml` on the branch. This file will have been recompiled with your schema changes after running `make squeaky-clean test all`. It is important to test against your changes, as this will be the newest version of the schema.
426+
427+
You can also test against the most recently-released schema (as opposed to a local branch). The steps for doing that are shown below. They involve making local changes to the file, `project.Makefile`. The changes involve the variable, `$(LATEST_TAG_SCHEMA_FILE)`, which contains the path to the most recent schema release file after it is downloaded from GitHub and is used for testing.
428+
429+
- In `local/mongo_via_api_as_unvalidated_nmdc_database.yaml` replace `--schema-source` with `$(LATEST_TAG_SCHEMA_FILE)`
430+
- Replace `local/mongo_via_api_as_nmdc_database_after_migrator.yaml: nmdc_schema/nmdc_materialized_patterns.yaml` with `local/mongo_via_api_as_nmdc_database_after_migrator.yaml: $(LATEST_TAG_SCHEMA_FILE)`
431+
- Replace `local/mongo_via_api_as_nmdc_database_validation.log: nmdc_schema/nmdc_materialized_patterns.yaml` with `local/mongo_via_api_as_nmdc_database_validation.log: $(LATEST_TAG_SCHEMA_FILE)`
432+
433+
> Remember not to commit these local changes as this will interfere with others' testing processes.
434+
435+

nmdc_schema/migrators/migrator_from_11_10_0_to_11_11_0.py

Lines changed: 95 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -9,41 +9,106 @@ class Migrator(MigratorBase):
99

1010
def upgrade(self) -> None:
1111
r"""Migrates the database from conforming to the original schema, to conforming to the new schema."""
12+
self.adapter.do_for_each_document("biosample_set", self.check_for_fields)
1213

13-
self.adapter.do_for_each_document(
14-
"data_object_set", [self.confirm_permissible_values_are_absent]
15-
)
16-
17-
def confirm_permissible_values_are_absent(self, data_object: dict) -> dict:
14+
def check_for_fields(self, biosample: dict) -> dict:
1815
r"""
19-
If the data object has the data_object_type of "Metagenome Bins" or "Centrifuge Classification Report" raise an exception.
16+
Check each biosample record to ensure none of the removed slots are being used.
17+
List of the slots that were removed:
18+
- dna_absorb1
19+
- dna_absorb2
20+
- dna_collect_site
21+
- dna_concentration
22+
- dna_cont_type
23+
- dna_cont_well
24+
- dna_container_id
25+
- dna_dnase
26+
- dna_isolate_meth
27+
- dna_organisms
28+
- dna_project_contact
29+
- dna_samp_id
30+
- dna_sample_format
31+
- dna_sample_name
32+
- dna_seq_project
33+
- dna_seq_project_pi
34+
- dna_seq_project_name
35+
- dna_volume
36+
- proposal_dna
37+
- dnase_rna
38+
- proposal_rna
39+
- rna_absorb1
40+
- rna_absorb2
41+
- rna_collect_site
42+
- rna_concentration
43+
- rna_cont_type
44+
- rna_cont_well
45+
- rna_container_id
46+
- rna_isolate_meth
47+
- rna_organisms
48+
- rna_project_contact
49+
- rna_samp_id
50+
- rna_sample_format
51+
- rna_sample_name
52+
- rna_seq_project
53+
- rna_seq_project_pi
54+
- rna_seq_project_name
55+
- rna_volume
56+
- collection_date_inc
2057
2158
>>> m = Migrator()
22-
23-
# Test: data_object_type of "Metagenome Bins"
24-
>>> m.confirm_permissible_values_are_absent({"id": 1, "type": "nmdc:DataObject", "data_object_type": "Metagenome Bins"})
59+
>>> m.check_for_fields({"id":123, "type": "nmdc:Biosample", "dna_absorb1": "value"})
2560
Traceback (most recent call last):
26-
...
27-
ValueError: DataObject 1 has value: Metagenome Bins
28-
29-
# Test: data_object_type of "Centrifuge Classification Report"
30-
>>> m.confirm_permissible_values_are_absent({"id": 2, "type": "nmdc:DataObject", "data_object_type": "Centrifuge Classification Report"})
61+
...
62+
Exception: Field `dna_absorb1` present in biosample 123.
63+
>>> m.check_for_fields({"id":123, "type": "nmdc:Biosample", "dna_absorb1": ""})
3164
Traceback (most recent call last):
32-
...
33-
ValueError: DataObject 2 has value: Centrifuge Classification Report
34-
35-
# Test: valid data_object_type
36-
>>> m.confirm_permissible_values_are_absent({"id": 3, "type": "nmdc:DataObject", "data_object_type": "Virus Summary"})
37-
{'id': 3, 'type': 'nmdc:DataObject', 'data_object_type': 'Virus Summary'}
65+
...
66+
Exception: Field `dna_absorb1` present in biosample 123.
67+
>>> m.check_for_fields({"id":123, "type": "nmdc:Biosample"})
3868
"""
69+
id = biosample.get("id")
70+
removed_slots = [
71+
"dna_absorb1",
72+
"dna_absorb2",
73+
"dna_collect_site",
74+
"dna_concentration",
75+
"dna_cont_type",
76+
"dna_cont_well",
77+
"dna_container_id",
78+
"dna_dnase",
79+
"dna_isolate_meth",
80+
"dna_organisms",
81+
"dna_project_contact",
82+
"dna_samp_id",
83+
"dna_sample_format",
84+
"dna_sample_name",
85+
"dna_seq_project",
86+
"dna_seq_project_pi",
87+
"dna_seq_project_name",
88+
"dna_volume",
89+
"proposal_dna",
90+
"dnase_rna",
91+
"proposal_rna",
92+
"rna_absorb1",
93+
"rna_absorb2",
94+
"rna_collect_site",
95+
"rna_concentration",
96+
"rna_cont_type",
97+
"rna_cont_well",
98+
"rna_container_id",
99+
"rna_isolate_meth",
100+
"rna_organisms",
101+
"rna_project_contact",
102+
"rna_samp_id",
103+
"rna_sample_format",
104+
"rna_sample_name",
105+
"rna_seq_project",
106+
"rna_seq_project_pi",
107+
"rna_seq_project_name",
108+
"rna_volume",
109+
"collection_date_inc",
110+
]
111+
for slot in removed_slots:
112+
if slot in biosample:
113+
raise Exception(f"Field `{slot}` present in biosample {id}.")
39114

40-
data_object_type_value = data_object.get("data_object_type")
41-
data_object_id = data_object.get("id")
42-
if (
43-
data_object_type_value == "Metagenome Bins"
44-
or data_object_type_value == "Centrifuge Classification Report"
45-
):
46-
raise ValueError(
47-
f"DataObject {data_object_id} has value: {data_object_type_value}"
48-
)
49-
return data_object

0 commit comments

Comments
 (0)