Skip to content
Open
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
2413b69
Fixes in Zenodo record 'type' (supports 'workflow' now), 'author', 'c…
rsirvent May 22, 2025
786994f
Added 'family_name' to the DataCite json, since it is mandatory in Ze…
rsirvent May 23, 2025
639b4ea
JSON fix, black broke it
rsirvent May 23, 2025
2d0ede0
Merged 'author' and 'creator' at the RO-Crate to do a single mapping.…
rsirvent May 23, 2025
3445658
Fixed license Title specification
rsirvent May 23, 2025
b52d119
RO-Crate can have a list of publishers, but DataCite only accepts one
rsirvent Sep 18, 2025
e728c39
Fixes when an 'author' is an entity but has no 'name' specified
rsirvent Sep 18, 2025
4626f25
Merge branch 'ResearchObject:main' into workflow_type
rsirvent Sep 18, 2025
d35c005
Tested licenses at Zenodo
rsirvent Sep 23, 2025
d6d16cb
Fixed spdx compatibility with Zenodo to detect licenses
rsirvent Sep 23, 2025
3f887e4
Fixed ROR mapping, and added unit tests
rsirvent Sep 25, 2025
88b3dc3
Style fixes and test_publisher.py fixes
rsirvent Sep 25, 2025
bef9b68
Line limit fix
rsirvent Sep 25, 2025
bb130b9
Misplaced ifNonePresent. Fixes when type is not a list
rsirvent Sep 26, 2025
f88d7bf
Fixes in integration test result, including ror and avoiding to add a…
rsirvent Sep 26, 2025
0fcbf1c
Fixes related to real-world-example test: fixed 'contributor', fixed …
rsirvent Sep 26, 2025
2cd44f2
Clean up test results
rsirvent Sep 26, 2025
2be4dfe
flake8 fixes
rsirvent Sep 26, 2025
96a33b5
Fixes on utf-8-csv-crate test. Provide real orcid and ror so the Zeno…
rsirvent Sep 30, 2025
8526384
Allow mapping arrays in RO-Crate's 'contentLocation'
rsirvent Sep 30, 2025
d1c3a87
Extra feature: allow mixed list of authors and creators that are stri…
rsirvent Oct 6, 2025
48aebfe
Added ifNonePresent rule for the resource_type property
rsirvent Oct 6, 2025
878aba4
Fixed merge_authors_and_creators to support strings and dicts
rsirvent Oct 6, 2025
f7af1ee
Style fixes
rsirvent Oct 6, 2025
105ccdb
More fixes
rsirvent Oct 6, 2025
068f8ac
Flak38 fix
rsirvent Oct 6, 2025
0ce6aa9
Update src/rocrate_inveniordm/mapping/converter.py
rsirvent Oct 20, 2025
94c2bca
Fixed conflict
rsirvent Oct 20, 2025
972030d
Fixes in rightsProcessing unit test. Added unit tests for merge_autho…
rsirvent Oct 21, 2025
8b6b934
Black fixes
rsirvent Oct 21, 2025
a3e564d
Fixes in rightsProcessing, only id or title is assigned, but not both…
rsirvent Oct 21, 2025
17cc496
Black fixes
rsirvent Oct 21, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 6 additions & 5 deletions docs/all-mappings.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,19 +7,20 @@ Note that RO-Crate and DataCite each contain features that the other does not ha
## Mapping of resource type

- `resource_type` is a mandatory field in DataCite
- RO-Crate does not have a field that describes the type of the entire directory
- Therefore, we assume the type to be `dataset`
- RO-Crate does not have a field that describes the type of the entire directory
- Therefore, we assume the type to be `dataset` by default
- Only if the 'mainEntity' includes the type 'ComputationalWorkflow', DataCite type is set to 'workflow'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A complication: A Workflow Run Crate might also have this mainEntity, but it's less obvious if that should be a "workflow" or a "dataset" type in Datacite (I would lean toward "dataset" in that case since the focus of a WRROC might be more on the outputs)

Maybe it's better to check if the crate conforms to Workflow RO-Crate only? But harder to do with the existing mapping structure

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the issue. Could you provide maybe an example?

WRROC's Workflow Run Crate inherits requirements from Workflow RO-Crate, which states:

Main Workflow
The Crate MUST contain a data entity of type ["File", "SoftwareSourceCode", "ComputationalWorkflow"] as the Main Workflow.

The Crate MUST refer to the Main Workflow via mainEntity.

So, both must have a 'ComputationalWorkflow' as 'mainEntity'.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, https://doi.org/10.5281/zenodo.12987289 is a WRROC, but the focus (for humans) is on the data outputs - I don't think "workflow" would be an appropriate type for this record.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, if the mainEntity of the RDE is a ComputationalWorkflow, it semantically means that the package is mainly a workflow. So even in that case you show (where it is both a Dataset and a ComputationalWorkflow) I would select the type as a workflow to create awareness that the option exists.

Anyway, for sure the mainEntity can have many different types, and at the end it is a subjective choice of the user what to select. But I think it is a good choice to select by default the Workflow type for the record, and then let the user manually correct it if that is not the case.

Do you have something else in mind? Maybe include more restrictions to select the Workflow type for the record? I'm afraid if we impose many restrictions, then the Workflow type will rarely be selected.


## Mapping of creators

- an `author` in RO-Crate is mapped to a `creator` in DataCite, alongside with their affiliations
- an `author` or a `creator` in RO-Crate is mapped to `creators` in DataCite, alongside with their affiliations
- if the `@id` field of an author is an ORCiD, the ORCiD field is parsed and added in DataCite
- consists of `person or organization` and `affiliation`
- if no creator exists, the creator is chosen to be the value `:unkn`

## Mapping of contributors

- similar to creator mapping
- similar to creator mapping, but only `contributor` is mapped to `contributors` in DataCite if they have been defined (since they are a valid schema.org term but not mandatory in RO-Crate)

## Mapping of title

Expand Down Expand Up @@ -50,9 +51,9 @@ Note that RO-Crate and DataCite each contain features that the other does not ha

## Mapping of rights/licenses

- the `identifier` field in DataCite is not mapped, since it defaults to SPDX this would require knowlege of the mapping of a licence URL to the SPDX id (https://spdx.org/licenses/)
- in case the RO-Crate does not reference another object, but contains a direct value the following is applied
- if the value is a URL: only set the link value in the DataCite file
- if the URL is an SPDX URL, the 'id', 'scheme' and 'title' fields are automatically generated from the URL
- if the value is freetext: only set the description value in the DataCite file

## Mapping of subjects
Expand Down
4 changes: 4 additions & 0 deletions src/rocrate_inveniordm/mapping/condition_functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,3 +49,7 @@ def embargoed(value):

def string(value):
return value and isinstance(value, str)


def ror(value):
return value and value.startswith("https://ror.org/")
61 changes: 51 additions & 10 deletions src/rocrate_inveniordm/mapping/converter.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,8 @@ def convert(rc: dict, metadata_only: bool = False) -> dict:
:return: Dictionary containing DataCite metadata
"""

rc = merge_authors_and_creators(rc)

m = load_mapping_json()

dc = setup_dc()
Expand Down Expand Up @@ -87,7 +89,7 @@ def convert(rc: dict, metadata_only: bool = False) -> dict:
print(f"\t|- Applying mapping {mapping_key}")

mapping = mappings.get(mapping_key)
dc, any_present = apply_mapping(mapping, mapping_paths, rc, dc)
dc, any_present = apply_mapping(mapping, mapping_paths, rc, dc, mapping_key)
is_any_present = is_any_present or any_present

if not is_any_present:
Expand Down Expand Up @@ -135,7 +137,7 @@ def get_mapping_paths(rc: dict, mappings: dict) -> dict:
return mapping_paths


def apply_mapping(mapping, mapping_paths, rc, dc): # noqa: C901
def apply_mapping(mapping, mapping_paths, rc, dc, mapping_key): # noqa: C901
"""Convert RO-Crate metadata to DataCite according to the specified mapping and
paths.

Expand All @@ -152,6 +154,7 @@ def apply_mapping(mapping, mapping_paths, rc, dc): # noqa: C901
:param mapping_paths: A list of paths, used to disambiguate array values
:param rc: Dictionary of RO-Crate metadata
:param dc: Dictionary of DataCite metadata
:param mapping_key: The key of the mapping being applied
:return: tuple containing the updated dictionary of DataCite metadata, and a boolean
indicating whether the rule was applied
"""
Expand Down Expand Up @@ -179,8 +182,15 @@ def apply_mapping(mapping, mapping_paths, rc, dc): # noqa: C901
paths = mapping_paths.get(processed_string)
print(f"\t\t|- Paths: {paths}")

for path in paths:
print(f"PATH: {path}")
for i, path in enumerate(paths):
if mapping_key.startswith("publisher_mapping") and i > 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add a test for this case?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you prefer me to add a new integration test? Or just modify one of the existing ones to have several publishers?????

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean a unit test, similar to these other ones for the publisher mapping: https://github.com/ResearchObject/ro-crate-inveniordm/blob/main/test/unit/test_publisher.py

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I don't think I can check this with these kind of unit tests, since they check that the application of a mapping is correct. In that code, what really happens is that we avoid to apply the rule if it has been applied previously for a previous 'path'.

I'm still open to add two publishers to an integration test and check that only one is mapped to the DataCite output.

# RO-Crate can have a list of publishers, but DataCite only supports one
# publisher. So, we only apply the first one.
print(
f"\t\t|- Skipping path {i} for mapping {mapping_key} to avoid "
"overwriting previous values."
)
continue
new_path = path.copy()
from_value = get_value_from_rc(rc.copy(), from_mapping_value, new_path)

Expand All @@ -189,12 +199,12 @@ def apply_mapping(mapping, mapping_paths, rc, dc): # noqa: C901
# must be implemented on how to handle it)
print(
"\t\t|- Result is a JSON object, so this rule cannot be applied. "
"Skipping to next rule."
"Skipping to next path."
)
from_value = None

# if (from_value is None):
# continue
if from_value is None:
continue

if only_if_value is not None:
print(f"\t\t|- Checking condition {only_if_value}")
Expand All @@ -213,8 +223,8 @@ def apply_mapping(mapping, mapping_paths, rc, dc): # noqa: C901
f"{path.copy()}"
)
rule_applied = True
print(dc, to_mapping_value, from_value)
dc = set_dc(dc, to_mapping_value, from_value, path.copy())
print(dc)

return dc, rule_applied

Expand Down Expand Up @@ -363,8 +373,13 @@ def set_dc(dictionary, key, value=None, path=[]):
path = path[1:]
last_val = current_dict[key_part[:-2]]

if len(current_dict[key_part[:-2]]) <= index:
current_dict[key_part[:-2]].append({})
while len(current_dict[key_part[:-2]]) <= index:
current_dict[key_part[:-2]].append(
{}
) # It expands 1 by 1 anyway, since no empty paths can remain after
# a mapping rule is applied

# print(f"INDEX: {index}, len of key: {len(current_dict[key_part[:-2]])}")

current_dict = current_dict[key_part[:-2]][index]

Expand Down Expand Up @@ -425,5 +440,31 @@ def process(process_rule, value):
return function(value)


def merge_authors_and_creators(rc: dict):
"""
Copy creators to authors in the RO-Crate, so they can be processed in a single
mapping. Mapping from 'author' to 'creators' and later from 'creator' to
'creators' causes overwritings.
"""

for rde in rc["@graph"]:
if "creator" in rde:
for person_or_org in rde["creator"]:
if isinstance(person_or_org, str):
added_authors = [item for item in rde["author"]]
if person_or_org not in added_authors:
rde["author"].append(person_or_org)
continue
urls_orcid = [
item["@id"]
for item in rde["author"]
if isinstance(item, dict) and "@id" in item
]
if person_or_org["@id"] not in urls_orcid:
rde["author"].append(person_or_org)

return rc


if __name__ == "__main__":
main()
2 changes: 1 addition & 1 deletion src/rocrate_inveniordm/mapping/crate_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ def get_value_from_rc(rc, from_key, path=[]):

print(f"\t\t|- Retrieving value {from_key} with path {path} from RO-Crate.")
keys = from_key.split(".")
print(keys)
# print(keys)
current_entity = rc_get_rde(rc)

for key in keys:
Expand Down
Loading