-
Notifications
You must be signed in to change notification settings - Fork 3
Add 'workflow' type mapping and several other fixes #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…reator' and 'contributor'
… Other fixes in name assignations
…uthors as contributors
…DataCite out considering contributors. Solved bug at format_value when arrays where used combined with @@this
…do upload does not fail. Avoid the editor mapping
|
Thanks @rsirvent! I was on leave last week and I am at a conference this week, so I'm a bit slow to respond. I'll get to this when I can (probably next week). |
|
Thanks @elichad, let me know if you need any clarification. Once we solve this PR, do you think you could do a new release of the library? It will help me a lot to be able to recommend the installation of a specific version of the package to publish COMPSs experiments as Zenodo records. |
…ngs or dicts. Still missing to fix merge_authors_and_creators
|
Sorry for the final minute updates, I wanted to add the support to have a mixed list of authors described as strings or dicts. I'm done now, I won't do any more commits for this branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rsirvent , sorry I took a while to review again. I've added a few comments.
I'll be happy to make a new release once this PR is done :)
| - Therefore, we assume the type to be `dataset` | ||
| - RO-Crate does not have a field that describes the type of the entire directory | ||
| - Therefore, we assume the type to be `dataset` by default | ||
| - Only if the 'mainEntity' includes the type 'ComputationalWorkflow', DataCite type is set to 'workflow' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A complication: A Workflow Run Crate might also have this mainEntity, but it's less obvious if that should be a "workflow" or a "dataset" type in Datacite (I would lean toward "dataset" in that case since the focus of a WRROC might be more on the outputs)
Maybe it's better to check if the crate conforms to Workflow RO-Crate only? But harder to do with the existing mapping structure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the issue. Could you provide maybe an example?
WRROC's Workflow Run Crate inherits requirements from Workflow RO-Crate, which states:
Main Workflow
The Crate MUST contain a data entity of type ["File", "SoftwareSourceCode", "ComputationalWorkflow"] as the Main Workflow.
The Crate MUST refer to the Main Workflow via mainEntity.
So, both must have a 'ComputationalWorkflow' as 'mainEntity'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, https://doi.org/10.5281/zenodo.12987289 is a WRROC, but the focus (for humans) is on the data outputs - I don't think "workflow" would be an appropriate type for this record.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, if the mainEntity of the RDE is a ComputationalWorkflow, it semantically means that the package is mainly a workflow. So even in that case you show (where it is both a Dataset and a ComputationalWorkflow) I would select the type as a workflow to create awareness that the option exists.
Anyway, for sure the mainEntity can have many different types, and at the end it is a subjective choice of the user what to select. But I think it is a good choice to select by default the Workflow type for the record, and then let the user manually correct it if that is not the case.
Do you have something else in mind? Maybe include more restrictions to select the Workflow type for the record? I'm afraid if we impose many restrictions, then the Workflow type will rarely be selected.
| new_name["family_name"] = parts[-1] if len(parts) > 0 else "" | ||
| new_name["given_name"] = " ".join(parts[:-1]) if len(parts) > 1 else "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of this approach to setting the names (see ResearchObject/ro-crate#496 which I have been thinking about for a while because of this issue)
e.g. in this case "José María Fernández González" (family name "Fernández González") would not be split correctly. And "Chadwick, Eli" wouldn't follow the right format.
My (unsatisfying) suggestion would be to put the whole name into givenName and have an empty familyName, if possible (or vice versa).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, I don't think this is a challenge that we can solve in this package alone, I think we need to change how we approach names in RO-Crate generally.
I did have the idea at some point to do an initial check through the authors and inform the user if the givenName/familyName weren't set, so they could have the option to set them manually before continuing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the previous mapping was just doing the opposite to what you propose, since familyName is mandatory in Zenodo, so the whole string went to familyName and it was ugly anyway. As a Spaniard with two last names I'm aware that this may fail for some cases, but generally speaking I think it is a better approach than what was before (to put everything as family name). Besides, in the end, the record can be later fixed by hand.
In any case, as you may have seen, if the RO-Crate already includes 'familyName' and 'givenName' they are respected and not replaced with this nameProcessing function result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, on reflection I agree with you. I'm happy with this code then, and I've made issue #47 for making it clear to users that this process is happening. (not expecting you to address that in this PR, but you can if you like)
if the RO-Crate already includes 'familyName' and 'givenName' they are respected and not replaced with this nameProcessing function result.
Unfortunately I don't think many real-world crates include those terms at the moment - so I've proposed in ResearchObject/ro-crate#496 that we make this a more explicit best practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I could improve now the explanation in the all-mappings.md description by now? And in the future we can work on issue #47 . Printing a warning is very easy, but making the process more interactive will need more work.
| for path in paths: | ||
| print(f"PATH: {path}") | ||
| for i, path in enumerate(paths): | ||
| if mapping_key.startswith("publisher_mapping") and i > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add a test for this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you prefer me to add a new integration test? Or just modify one of the existing ones to have several publishers?????
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean a unit test, similar to these other ones for the publisher mapping: https://github.com/ResearchObject/ro-crate-inveniordm/blob/main/test/unit/test_publisher.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, I don't think I can check this with these kind of unit tests, since they check that the application of a mapping is correct. In that code, what really happens is that we avoid to apply the rule if it has been applied previously for a previous 'path'.
I'm still open to add two publishers to an integration test and check that only one is mapped to the DataCite output.
Remove debugging message. Co-authored-by: Eli Chadwick <eli.chadwick@manchester.ac.uk>
…. Fixed unit tests to detect this
Fixes issues: #2 #3 #34 #36
A COMPSs example with all new features tested directly at Zenodo can be found here: https://zenodo.org/records/17259350