Skip to content

Conversation

@rsirvent
Copy link

@rsirvent rsirvent commented Sep 25, 2025

Fixes issues: #2 #3 #34 #36

  • Also fixed: if the RO-Crate has several publishers, map only the first one (Zenodo only accpets 1 publisher).
  • Unit tests added.

A COMPSs example with all new features tested directly at Zenodo can be found here: https://zenodo.org/records/17259350

@rsirvent rsirvent requested a review from elichad September 25, 2025 11:11
@rsirvent rsirvent self-assigned this Sep 25, 2025
@elichad
Copy link
Collaborator

elichad commented Sep 30, 2025

Thanks @rsirvent! I was on leave last week and I am at a conference this week, so I'm a bit slow to respond. I'll get to this when I can (probably next week).

@rsirvent
Copy link
Author

rsirvent commented Oct 1, 2025

Thanks @elichad, let me know if you need any clarification.

Once we solve this PR, do you think you could do a new release of the library? It will help me a lot to be able to recommend the installation of a specific version of the package to publish COMPSs experiments as Zenodo records.

@rsirvent
Copy link
Author

rsirvent commented Oct 6, 2025

Sorry for the final minute updates, I wanted to add the support to have a mixed list of authors described as strings or dicts. I'm done now, I won't do any more commits for this branch.

Copy link
Collaborator

@elichad elichad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rsirvent , sorry I took a while to review again. I've added a few comments.

I'll be happy to make a new release once this PR is done :)

- Therefore, we assume the type to be `dataset`
- RO-Crate does not have a field that describes the type of the entire directory
- Therefore, we assume the type to be `dataset` by default
- Only if the 'mainEntity' includes the type 'ComputationalWorkflow', DataCite type is set to 'workflow'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A complication: A Workflow Run Crate might also have this mainEntity, but it's less obvious if that should be a "workflow" or a "dataset" type in Datacite (I would lean toward "dataset" in that case since the focus of a WRROC might be more on the outputs)

Maybe it's better to check if the crate conforms to Workflow RO-Crate only? But harder to do with the existing mapping structure

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the issue. Could you provide maybe an example?

WRROC's Workflow Run Crate inherits requirements from Workflow RO-Crate, which states:

Main Workflow
The Crate MUST contain a data entity of type ["File", "SoftwareSourceCode", "ComputationalWorkflow"] as the Main Workflow.

The Crate MUST refer to the Main Workflow via mainEntity.

So, both must have a 'ComputationalWorkflow' as 'mainEntity'.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, https://doi.org/10.5281/zenodo.12987289 is a WRROC, but the focus (for humans) is on the data outputs - I don't think "workflow" would be an appropriate type for this record.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, if the mainEntity of the RDE is a ComputationalWorkflow, it semantically means that the package is mainly a workflow. So even in that case you show (where it is both a Dataset and a ComputationalWorkflow) I would select the type as a workflow to create awareness that the option exists.

Anyway, for sure the mainEntity can have many different types, and at the end it is a subjective choice of the user what to select. But I think it is a good choice to select by default the Workflow type for the record, and then let the user manually correct it if that is not the case.

Do you have something else in mind? Maybe include more restrictions to select the Workflow type for the record? I'm afraid if we impose many restrictions, then the Workflow type will rarely be selected.

Comment on lines +129 to +130
new_name["family_name"] = parts[-1] if len(parts) > 0 else ""
new_name["given_name"] = " ".join(parts[:-1]) if len(parts) > 1 else ""
Copy link
Collaborator

@elichad elichad Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of this approach to setting the names (see ResearchObject/ro-crate#496 which I have been thinking about for a while because of this issue)

e.g. in this case "José María Fernández González" (family name "Fernández González") would not be split correctly. And "Chadwick, Eli" wouldn't follow the right format.

My (unsatisfying) suggestion would be to put the whole name into givenName and have an empty familyName, if possible (or vice versa).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, I don't think this is a challenge that we can solve in this package alone, I think we need to change how we approach names in RO-Crate generally.

I did have the idea at some point to do an initial check through the authors and inform the user if the givenName/familyName weren't set, so they could have the option to set them manually before continuing

Copy link
Author

@rsirvent rsirvent Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the previous mapping was just doing the opposite to what you propose, since familyName is mandatory in Zenodo, so the whole string went to familyName and it was ugly anyway. As a Spaniard with two last names I'm aware that this may fail for some cases, but generally speaking I think it is a better approach than what was before (to put everything as family name). Besides, in the end, the record can be later fixed by hand.

In any case, as you may have seen, if the RO-Crate already includes 'familyName' and 'givenName' they are respected and not replaced with this nameProcessing function result.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, on reflection I agree with you. I'm happy with this code then, and I've made issue #47 for making it clear to users that this process is happening. (not expecting you to address that in this PR, but you can if you like)

if the RO-Crate already includes 'familyName' and 'givenName' they are respected and not replaced with this nameProcessing function result.

Unfortunately I don't think many real-world crates include those terms at the moment - so I've proposed in ResearchObject/ro-crate#496 that we make this a more explicit best practice.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I could improve now the explanation in the all-mappings.md description by now? And in the future we can work on issue #47 . Printing a warning is very easy, but making the process more interactive will need more work.

for path in paths:
print(f"PATH: {path}")
for i, path in enumerate(paths):
if mapping_key.startswith("publisher_mapping") and i > 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add a test for this case?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you prefer me to add a new integration test? Or just modify one of the existing ones to have several publishers?????

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean a unit test, similar to these other ones for the publisher mapping: https://github.com/ResearchObject/ro-crate-inveniordm/blob/main/test/unit/test_publisher.py

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I don't think I can check this with these kind of unit tests, since they check that the application of a mapping is correct. In that code, what really happens is that we avoid to apply the rule if it has been applied previously for a previous 'path'.

I'm still open to add two publishers to an integration test and check that only one is mapped to the DataCite output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants