Address RDA Working Group on Dynamic Data Citation (WGDC) recommendations

**Executive summary:**

- Maybe pxweb package could include functions for calculating hashes for queries and downloaded datasets?
- Otherwise taking into account or acknowledging the WGDC recommendations might be helpful in setting long-term goals for further package development

**Background information:**

Research Data Alliance [Data Citation WG](https://www.rd-alliance.org/groups/data-citation-wg.html) has listed 14 recommendations on data reproducibly subsetting datasets and how to cite, share and re-use these subsets: 

- 2 page summary: https://zenodo.org/record/1406002
- more extensive report: https://zenodo.org/record/4048304

While data retrieved from PxWeb APIs is maybe not as dynamic as other kinds of data but still occasionally changing (see [stat.fi news page](https://www.stat.fi/tup/statfin/uutiset_en.html), there are some nice recommendations that could be at least acknowledged and, if possible, also implemented. 

Here is a list of the recommendations:

| Task                         | Status | Viability                                                                                                           |
|------------------------------|--------|---------------------------------------------------------------------------------------------------------------------|
| R1 Data Versioning           |        | Data versioning not supported PxWeb                                                                                 |
| R2 Timestamping              |        | Timestamping dataset changes so that querying past datasets would be possible is not supported by PxWeb                                                                 |
| R3 Query Store Facilities    |        | Some pxweb database websites have "Save your query" menu but does not include all the data that WGDC recommends it should have      |
| R4 Query Uniqueness          |        | Pxweb interactive constructs queries in a normalised form, could also calculate MD5 hash to query                   |
| R5 Stable Sorting            |        | Dataset sorting is determined by the sorting of raw data in server                                                  |
| R6 Result Set Verification   |        | Fixity key for downloaded datasets, could be done with digest(dataset, algo = "md5")                                |
| R7 Query Timestamping        | Done   | Could also refer to the dataset date of last update                                                                 |
| R8 Query PID                 |        | Assign a DOI, ARK, or similar PID to a unique query                   |
| R9 Store the Query           | Done   | Refers to R3 "facilities" but query is printed by pxweb and that can be put into article appendices    |
| R10 Automated Citation Texts | Done   |                                                                                                                     |
| R11 Landing Page             |        | Now citation links to .px dataset, proper landing pages with documentation might not be available for all databases. Stat.fi has "Statistics homepage" for most (all?) datasets / topics |
| R12 Machine Actionability    |        | Link to metadata landing page or JSON file                                                                          |
| R13 Technology Migration     |        | Responsibility of API / db maintainers                                                                              |
| R14 Migration Verification   |        | Compare fixity (hash) information of queries and outputs and see if they are identical          |


Recommendations are grouped as follows: R1-3 "Preparing the Data and the Query Store", R4-10 "Persistently Identifying Specific Data Sets", R11-12 "Resolving PIDs and Retrieving the Data" and R13-14 "Upon modifications to the Data Infrastructure". 

Especially interesting, in my opinion, would be to integrate the calculation of query and downloaded dataset hashes (R4, R6) and storing them somewhere alongside other citation data. 

Additionally, R12 could be somewhat achieved by changing the URL in the following citation

```
  @Misc{,
    title = {Foreign languages selected by upper secondary level students by Year, Area, Gender, Level of education and Information},
    author = {{Statistics Finland}},
    organization = {Statistics Finland},
    address = {Helsinki, Finland},
    year = {2023},
    url = {https://statfin.stat.fi/PXWeb/api/v1/en/StatFin/ava/statfin_ava_pxt_12ad.px},
    note = {[Data accessed 2023-06-14 14:20:20.456548 using pxweb R package 0.16.3]},
  }
```

to simply https://stat.fi/en/statistics/ava which is closest equivalent to a landing page. I'm not sure if this URL is accessible from the API but it's listed at least in a separate csv file: https://statfin.stat.fi/database/StatFin/StatFin_rap.csv

R4 and R5 are kind of done if you use `pxweb_interactive()` as the order which items are printed in is very deterministic. If the order of query printout or dataset items is changed in any way md5 hashes change as well.

The different recommendations are, I think, most useful for Pxweb database maintainers and Pxweb developers in SCB, but we could do our own part to think about solutions to the proposed recommendations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Address RDA Working Group on Dynamic Data Citation (WGDC) recommendations #266

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Task	Status	Viability
R1 Data Versioning		Data versioning not supported PxWeb
R2 Timestamping		Timestamping dataset changes so that querying past datasets would be possible is not supported by PxWeb
R3 Query Store Facilities		Some pxweb database websites have "Save your query" menu but does not include all the data that WGDC recommends it should have
R4 Query Uniqueness		Pxweb interactive constructs queries in a normalised form, could also calculate MD5 hash to query
R5 Stable Sorting		Dataset sorting is determined by the sorting of raw data in server
R6 Result Set Verification		Fixity key for downloaded datasets, could be done with digest(dataset, algo = "md5")
R7 Query Timestamping	Done	Could also refer to the dataset date of last update
R8 Query PID		Assign a DOI, ARK, or similar PID to a unique query
R9 Store the Query	Done	Refers to R3 "facilities" but query is printed by pxweb and that can be put into article appendices
R10 Automated Citation Texts	Done
R11 Landing Page		Now citation links to .px dataset, proper landing pages with documentation might not be available for all databases. Stat.fi has "Statistics homepage" for most (all?) datasets / topics
R12 Machine Actionability		Link to metadata landing page or JSON file
R13 Technology Migration		Responsibility of API / db maintainers
R14 Migration Verification		Compare fixity (hash) information of queries and outputs and see if they are identical

Address RDA Working Group on Dynamic Data Citation (WGDC) recommendations #266

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions