- 
                Notifications
    
You must be signed in to change notification settings  - Fork 3
 
Add 'workflow' type mapping and several other fixes #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 26 commits
2413b69
              786994f
              639b4ea
              2d0ede0
              3445658
              b52d119
              e728c39
              4626f25
              d35c005
              d6d16cb
              3f887e4
              88b3dc3
              bef9b68
              bb130b9
              f88d7bf
              0fcbf1c
              2cd44f2
              2be4dfe
              96a33b5
              8526384
              d1c3a87
              48aebfe
              878aba4
              f7af1ee
              105ccdb
              068f8ac
              0ce6aa9
              94c2bca
              972030d
              8b6b934
              a3e564d
              17cc496
              File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | 
|---|---|---|
| 
          
            
          
           | 
    @@ -7,19 +7,20 @@ Note that RO-Crate and DataCite each contain features that the other does not ha | |
| ## Mapping of resource type | ||
| 
     | 
||
| - `resource_type` is a mandatory field in DataCite | ||
| - RO-Crate does not have a field that describes the type of the entire directory | ||
| - Therefore, we assume the type to be `dataset` | ||
| - RO-Crate does not have a field that describes the type of the entire directory | ||
| - Therefore, we assume the type to be `dataset` by default | ||
| - Only if the 'mainEntity' includes the type 'ComputationalWorkflow', DataCite type is set to 'workflow' | ||
| 
         There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A complication: A Workflow Run Crate might also have this  Maybe it's better to check if the crate conforms to Workflow RO-Crate only? But harder to do with the existing mapping structure There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't understand the issue. Could you provide maybe an example? WRROC's Workflow Run Crate inherits requirements from Workflow RO-Crate, which states: So, both must have a 'ComputationalWorkflow' as 'mainEntity'. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For example, https://doi.org/10.5281/zenodo.12987289 is a WRROC, but the focus (for humans) is on the data outputs - I don't think "workflow" would be an appropriate type for this record. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To me, if the mainEntity of the RDE is a ComputationalWorkflow, it semantically means that the package is mainly a workflow. So even in that case you show (where it is both a Dataset and a ComputationalWorkflow) I would select the type as a workflow to create awareness that the option exists. Anyway, for sure the mainEntity can have many different types, and at the end it is a subjective choice of the user what to select. But I think it is a good choice to select by default the Workflow type for the record, and then let the user manually correct it if that is not the case. Do you have something else in mind? Maybe include more restrictions to select the Workflow type for the record? I'm afraid if we impose many restrictions, then the Workflow type will rarely be selected.  | 
||
| 
     | 
||
| ## Mapping of creators | ||
| 
     | 
||
| - an `author` in RO-Crate is mapped to a `creator` in DataCite, alongside with their affiliations | ||
| - an `author` or a `creator` in RO-Crate is mapped to `creators` in DataCite, alongside with their affiliations | ||
| - if the `@id` field of an author is an ORCiD, the ORCiD field is parsed and added in DataCite | ||
| - consists of `person or organization` and `affiliation` | ||
| - if no creator exists, the creator is chosen to be the value `:unkn` | ||
| 
     | 
||
| ## Mapping of contributors | ||
| 
     | 
||
| - similar to creator mapping | ||
| - similar to creator mapping, but only `contributor` is mapped to `contributors` in DataCite if they have been defined (since they are a valid schema.org term but not mandatory in RO-Crate) | ||
| 
     | 
||
| ## Mapping of title | ||
| 
     | 
||
| 
          
            
          
           | 
    @@ -50,9 +51,9 @@ Note that RO-Crate and DataCite each contain features that the other does not ha | |
| 
     | 
||
| ## Mapping of rights/licenses | ||
| 
     | 
||
| - the `identifier` field in DataCite is not mapped, since it defaults to SPDX this would require knowlege of the mapping of a licence URL to the SPDX id (https://spdx.org/licenses/) | ||
| - in case the RO-Crate does not reference another object, but contains a direct value the following is applied | ||
| - if the value is a URL: only set the link value in the DataCite file | ||
| - if the URL is an SPDX URL, the 'id', 'scheme' and 'title' fields are automatically generated from the URL | ||
| - if the value is freetext: only set the description value in the DataCite file | ||
| 
     | 
||
| ## Mapping of subjects | ||
| 
          
            
          
           | 
    ||
| Original file line number | Diff line number | Diff line change | 
|---|---|---|
| 
          
            
          
           | 
    @@ -51,6 +51,8 @@ def convert(rc: dict, metadata_only: bool = False) -> dict: | |
| :return: Dictionary containing DataCite metadata | ||
| """ | ||
| 
     | 
||
| rc = merge_authors_and_creators(rc) | ||
| 
     | 
||
| m = load_mapping_json() | ||
| 
     | 
||
| dc = setup_dc() | ||
| 
          
            
          
           | 
    @@ -87,7 +89,7 @@ def convert(rc: dict, metadata_only: bool = False) -> dict: | |
| print(f"\t|- Applying mapping {mapping_key}") | ||
| 
     | 
||
| mapping = mappings.get(mapping_key) | ||
| dc, any_present = apply_mapping(mapping, mapping_paths, rc, dc) | ||
| dc, any_present = apply_mapping(mapping, mapping_paths, rc, dc, mapping_key) | ||
| is_any_present = is_any_present or any_present | ||
| 
     | 
||
| if not is_any_present: | ||
| 
          
            
          
           | 
    @@ -135,7 +137,7 @@ def get_mapping_paths(rc: dict, mappings: dict) -> dict: | |
| return mapping_paths | ||
| 
     | 
||
| 
     | 
||
| def apply_mapping(mapping, mapping_paths, rc, dc): # noqa: C901 | ||
| def apply_mapping(mapping, mapping_paths, rc, dc, mapping_key): # noqa: C901 | ||
| """Convert RO-Crate metadata to DataCite according to the specified mapping and | ||
| paths. | ||
| 
     | 
||
| 
        
          
        
         | 
    @@ -152,6 +154,7 @@ def apply_mapping(mapping, mapping_paths, rc, dc): # noqa: C901 | |
| :param mapping_paths: A list of paths, used to disambiguate array values | ||
| :param rc: Dictionary of RO-Crate metadata | ||
| :param dc: Dictionary of DataCite metadata | ||
| :param mapping_key: The key of the mapping being applied | ||
| :return: tuple containing the updated dictionary of DataCite metadata, and a boolean | ||
| indicating whether the rule was applied | ||
| """ | ||
| 
          
            
          
           | 
    @@ -179,8 +182,15 @@ def apply_mapping(mapping, mapping_paths, rc, dc): # noqa: C901 | |
| paths = mapping_paths.get(processed_string) | ||
| print(f"\t\t|- Paths: {paths}") | ||
| 
     | 
||
| for path in paths: | ||
| print(f"PATH: {path}") | ||
| for i, path in enumerate(paths): | ||
| if mapping_key.startswith("publisher_mapping") and i > 0: | ||
| 
         There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. could you add a test for this case? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would you prefer me to add a new integration test? Or just modify one of the existing ones to have several publishers????? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I mean a unit test, similar to these other ones for the publisher mapping: https://github.com/ResearchObject/ro-crate-inveniordm/blob/main/test/unit/test_publisher.py There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmmm, I don't think I can check this with these kind of unit tests, since they check that the application of a mapping is correct. In that code, what really happens is that we avoid to apply the rule if it has been applied previously for a previous 'path'. I'm still open to add two publishers to an integration test and check that only one is mapped to the DataCite output.  | 
||
| # RO-Crate can have a list of publishers, but DataCite only supports one | ||
| # publisher. So, we only apply the first one. | ||
| print( | ||
| f"\t\t|- Skipping path {i} for mapping {mapping_key} to avoid " | ||
| "overwriting previous values." | ||
| ) | ||
| continue | ||
| new_path = path.copy() | ||
| from_value = get_value_from_rc(rc.copy(), from_mapping_value, new_path) | ||
| 
     | 
||
| 
        
          
        
         | 
    @@ -189,12 +199,12 @@ def apply_mapping(mapping, mapping_paths, rc, dc): # noqa: C901 | |
| # must be implemented on how to handle it) | ||
| print( | ||
| "\t\t|- Result is a JSON object, so this rule cannot be applied. " | ||
| "Skipping to next rule." | ||
| "Skipping to next path." | ||
| ) | ||
| from_value = None | ||
| 
     | 
||
| # if (from_value is None): | ||
| # continue | ||
| if from_value is None: | ||
| continue | ||
| 
     | 
||
| if only_if_value is not None: | ||
| print(f"\t\t|- Checking condition {only_if_value}") | ||
| 
        
          
        
         | 
    @@ -213,8 +223,8 @@ def apply_mapping(mapping, mapping_paths, rc, dc): # noqa: C901 | |
| f"{path.copy()}" | ||
| ) | ||
| rule_applied = True | ||
| print(dc, to_mapping_value, from_value) | ||
| dc = set_dc(dc, to_mapping_value, from_value, path.copy()) | ||
| print(dc) | ||
| 
     | 
||
| return dc, rule_applied | ||
| 
     | 
||
| 
          
            
          
           | 
    @@ -363,8 +373,13 @@ def set_dc(dictionary, key, value=None, path=[]): | |
| path = path[1:] | ||
| last_val = current_dict[key_part[:-2]] | ||
| 
     | 
||
| if len(current_dict[key_part[:-2]]) <= index: | ||
| current_dict[key_part[:-2]].append({}) | ||
| while len(current_dict[key_part[:-2]]) <= index: | ||
| current_dict[key_part[:-2]].append( | ||
| {} | ||
| ) # It expands 1 by 1 anyway, since no empty paths can remain after | ||
| # a mapping rule is applied | ||
| 
     | 
||
| # print(f"INDEX: {index}, len of key: {len(current_dict[key_part[:-2]])}") | ||
                
      
                  rsirvent marked this conversation as resolved.
               
              
                Outdated
          
            Show resolved
            Hide resolved
         | 
||
| 
     | 
||
| current_dict = current_dict[key_part[:-2]][index] | ||
| 
     | 
||
| 
          
            
          
           | 
    @@ -425,5 +440,31 @@ def process(process_rule, value): | |
| return function(value) | ||
| 
     | 
||
| 
     | 
||
| def merge_authors_and_creators(rc: dict): | ||
                
      
                  rsirvent marked this conversation as resolved.
               
          
            Show resolved
            Hide resolved
         | 
||
| """ | ||
| Copy creators to authors in the RO-Crate, so they can be processed in a single | ||
| mapping. Mapping from 'author' to 'creators' and later from 'creator' to | ||
| 'creators' causes overwritings. | ||
| """ | ||
| 
     | 
||
| for rde in rc["@graph"]: | ||
| if "creator" in rde: | ||
| for person_or_org in rde["creator"]: | ||
| if isinstance(person_or_org, str): | ||
| added_authors = [item for item in rde["author"]] | ||
| if person_or_org not in added_authors: | ||
| rde["author"].append(person_or_org) | ||
| continue | ||
| urls_orcid = [ | ||
| item["@id"] | ||
| for item in rde["author"] | ||
| if isinstance(item, dict) and "@id" in item | ||
| ] | ||
| if person_or_org["@id"] not in urls_orcid: | ||
| rde["author"].append(person_or_org) | ||
| 
     | 
||
| return rc | ||
| 
     | 
||
| 
     | 
||
| if __name__ == "__main__": | ||
| main() | ||
Uh oh!
There was an error while loading. Please reload this page.