-
Notifications
You must be signed in to change notification settings - Fork 29
Description
Executive summary:
- Maybe pxweb package could include functions for calculating hashes for queries and downloaded datasets?
- Otherwise taking into account or acknowledging the WGDC recommendations might be helpful in setting long-term goals for further package development
Background information:
Research Data Alliance Data Citation WG has listed 14 recommendations on data reproducibly subsetting datasets and how to cite, share and re-use these subsets:
- 2 page summary: https://zenodo.org/record/1406002
- more extensive report: https://zenodo.org/record/4048304
While data retrieved from PxWeb APIs is maybe not as dynamic as other kinds of data but still occasionally changing (see stat.fi news page, there are some nice recommendations that could be at least acknowledged and, if possible, also implemented.
Here is a list of the recommendations:
Task | Status | Viability |
---|---|---|
R1 Data Versioning | Data versioning not supported PxWeb | |
R2 Timestamping | Timestamping dataset changes so that querying past datasets would be possible is not supported by PxWeb | |
R3 Query Store Facilities | Some pxweb database websites have "Save your query" menu but does not include all the data that WGDC recommends it should have | |
R4 Query Uniqueness | Pxweb interactive constructs queries in a normalised form, could also calculate MD5 hash to query | |
R5 Stable Sorting | Dataset sorting is determined by the sorting of raw data in server | |
R6 Result Set Verification | Fixity key for downloaded datasets, could be done with digest(dataset, algo = "md5") | |
R7 Query Timestamping | Done | Could also refer to the dataset date of last update |
R8 Query PID | Assign a DOI, ARK, or similar PID to a unique query | |
R9 Store the Query | Done | Refers to R3 "facilities" but query is printed by pxweb and that can be put into article appendices |
R10 Automated Citation Texts | Done | |
R11 Landing Page | Now citation links to .px dataset, proper landing pages with documentation might not be available for all databases. Stat.fi has "Statistics homepage" for most (all?) datasets / topics | |
R12 Machine Actionability | Link to metadata landing page or JSON file | |
R13 Technology Migration | Responsibility of API / db maintainers | |
R14 Migration Verification | Compare fixity (hash) information of queries and outputs and see if they are identical |
Recommendations are grouped as follows: R1-3 "Preparing the Data and the Query Store", R4-10 "Persistently Identifying Specific Data Sets", R11-12 "Resolving PIDs and Retrieving the Data" and R13-14 "Upon modifications to the Data Infrastructure".
Especially interesting, in my opinion, would be to integrate the calculation of query and downloaded dataset hashes (R4, R6) and storing them somewhere alongside other citation data.
Additionally, R12 could be somewhat achieved by changing the URL in the following citation
@Misc{,
title = {Foreign languages selected by upper secondary level students by Year, Area, Gender, Level of education and Information},
author = {{Statistics Finland}},
organization = {Statistics Finland},
address = {Helsinki, Finland},
year = {2023},
url = {https://statfin.stat.fi/PXWeb/api/v1/en/StatFin/ava/statfin_ava_pxt_12ad.px},
note = {[Data accessed 2023-06-14 14:20:20.456548 using pxweb R package 0.16.3]},
}
to simply https://stat.fi/en/statistics/ava which is closest equivalent to a landing page. I'm not sure if this URL is accessible from the API but it's listed at least in a separate csv file: https://statfin.stat.fi/database/StatFin/StatFin_rap.csv
R4 and R5 are kind of done if you use pxweb_interactive()
as the order which items are printed in is very deterministic. If the order of query printout or dataset items is changed in any way md5 hashes change as well.
The different recommendations are, I think, most useful for Pxweb database maintainers and Pxweb developers in SCB, but we could do our own part to think about solutions to the proposed recommendations.