Skip to content

Address RDA Working Group on Dynamic Data Citation (WGDC) recommendations #266

@pitkant

Description

@pitkant

Executive summary:

  • Maybe pxweb package could include functions for calculating hashes for queries and downloaded datasets?
  • Otherwise taking into account or acknowledging the WGDC recommendations might be helpful in setting long-term goals for further package development

Background information:

Research Data Alliance Data Citation WG has listed 14 recommendations on data reproducibly subsetting datasets and how to cite, share and re-use these subsets:

While data retrieved from PxWeb APIs is maybe not as dynamic as other kinds of data but still occasionally changing (see stat.fi news page, there are some nice recommendations that could be at least acknowledged and, if possible, also implemented.

Here is a list of the recommendations:

Task Status Viability
R1 Data Versioning Data versioning not supported PxWeb
R2 Timestamping Timestamping dataset changes so that querying past datasets would be possible is not supported by PxWeb
R3 Query Store Facilities Some pxweb database websites have "Save your query" menu but does not include all the data that WGDC recommends it should have
R4 Query Uniqueness Pxweb interactive constructs queries in a normalised form, could also calculate MD5 hash to query
R5 Stable Sorting Dataset sorting is determined by the sorting of raw data in server
R6 Result Set Verification Fixity key for downloaded datasets, could be done with digest(dataset, algo = "md5")
R7 Query Timestamping Done Could also refer to the dataset date of last update
R8 Query PID Assign a DOI, ARK, or similar PID to a unique query
R9 Store the Query Done Refers to R3 "facilities" but query is printed by pxweb and that can be put into article appendices
R10 Automated Citation Texts Done
R11 Landing Page Now citation links to .px dataset, proper landing pages with documentation might not be available for all databases. Stat.fi has "Statistics homepage" for most (all?) datasets / topics
R12 Machine Actionability Link to metadata landing page or JSON file
R13 Technology Migration Responsibility of API / db maintainers
R14 Migration Verification Compare fixity (hash) information of queries and outputs and see if they are identical

Recommendations are grouped as follows: R1-3 "Preparing the Data and the Query Store", R4-10 "Persistently Identifying Specific Data Sets", R11-12 "Resolving PIDs and Retrieving the Data" and R13-14 "Upon modifications to the Data Infrastructure".

Especially interesting, in my opinion, would be to integrate the calculation of query and downloaded dataset hashes (R4, R6) and storing them somewhere alongside other citation data.

Additionally, R12 could be somewhat achieved by changing the URL in the following citation

  @Misc{,
    title = {Foreign languages selected by upper secondary level students by Year, Area, Gender, Level of education and Information},
    author = {{Statistics Finland}},
    organization = {Statistics Finland},
    address = {Helsinki, Finland},
    year = {2023},
    url = {https://statfin.stat.fi/PXWeb/api/v1/en/StatFin/ava/statfin_ava_pxt_12ad.px},
    note = {[Data accessed 2023-06-14 14:20:20.456548 using pxweb R package 0.16.3]},
  }

to simply https://stat.fi/en/statistics/ava which is closest equivalent to a landing page. I'm not sure if this URL is accessible from the API but it's listed at least in a separate csv file: https://statfin.stat.fi/database/StatFin/StatFin_rap.csv

R4 and R5 are kind of done if you use pxweb_interactive() as the order which items are printed in is very deterministic. If the order of query printout or dataset items is changed in any way md5 hashes change as well.

The different recommendations are, I think, most useful for Pxweb database maintainers and Pxweb developers in SCB, but we could do our own part to think about solutions to the proposed recommendations.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions