Openaire (#18)

Sandra Mierz · web-flow · commit 13a19b7063e8 · 2022-03-21T11:26:02.000+01:00
* query openaire for person-works
* query openaire for work-projects
diff --git a/README.md b/README.md
@@ -3,7 +3,7 @@
 [![DOI](https://zenodo.org/badge/447263093.svg)](https://zenodo.org/badge/latestdoi/447263093)
 [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/Project-TAPIR/pidgraph-notebooks/main)
 
-A collection of Jupyter notebooks with examples of querying different PID providers like [ORCID](https://orcid.org/), [ROR](https://ror.readme.io/), [Crossref](https://www.crossref.org/) and PID graphs like the [FREYA PID Graph](https://blog.datacite.org/powering-the-pid-graph/) and [OpenAlex](https://openalex.org/about) for connected objects. 
+A collection of Jupyter notebooks with examples of querying different PID providers like [ORCID](https://orcid.org/), [ROR](https://ror.readme.io/), [Crossref](https://www.crossref.org/) and PID graphs like the [FREYA PID Graph](https://blog.datacite.org/powering-the-pid-graph/), [OpenAlex](https://openalex.org/about) and [OpenAIRE](https://www.openaire.eu/) for connected objects. 
 
 Currently included connections:
 * organization-organization
@@ -17,7 +17,11 @@ Currently included connections:
 * person-works
   * input: ORCID
   * output: list of works authored/created by the person, each identified by their DOI
-  * data sources: Crossref, FREYA PID Graph, OpenAlex, ORCID
+  * data sources: Crossref, FREYA PID Graph, OpenAlex, ORCID, OpenAIRE
+* work-projects
+  * input: DOI
+  * output: list of projects the work was produced in, each identified by their OpenAIRE project ID
+  * data sources: OpenAIRE
   
 
 Please navigate into the respective folder to see the list of available notebooks. 
diff --git a/person-works/README.md b/person-works/README.md
@@ -6,4 +6,5 @@ Currently available PID Graphs:
 * [Crossref](https://www.crossref.org/)
 * [FREYA PID Graph](https://blog.datacite.org/powering-the-pid-graph/)
 * [OpenAlex](https://openalex.org/about)
-* [ORCID](https://orcid.org/)
+* [ORCID](https://orcid.org/)
+* [OpenAIRE](https://www.openaire.eu/)
diff --git a/person-works/openaire_get_publications_by_person.ipynb b/person-works/openaire_get_publications_by_person.ipynb
@@ -0,0 +1,198 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Query OpenAIRE for publications authored by a person\n",
+    "This notebook queries the [OpenAIRE HTTP API](https://graph.openaire.eu/develop/api.html) via its `/publications` endpoint for publications authored by a person. It takes an ORCID iD as input which is used to filter for publications where one of the creators' `orcid` field matches the given ORCID iD. From the resulting list of publications we output all DOIs.\n",
+    "\n",
+    "*Note:\n",
+    "The API has several different endpoints for research outputs: they are divided into publications, research data, software metadata and other research products, so to get a full picture about a person's research output, you would have to query all of these endpoints and union their results.*"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Prerequisites:\n",
+    "import requests                    # dependency for making HTTP calls\n",
+    "from benedict import benedict      # dependency for dealing with json"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": true,
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   },
+   "source": [
+    "The input for this notebook is an ORCID iD, e.g. '`0000-0003-2499-7741`'."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# input parameter\n",
+    "example_orcid_id=\"0000-0003-2499-7741\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We use it to query the OpenAIRE HTTP API for publications that specified the ORCID iD within their metadata in one of the creators `orcid` field. Since the API uses pagination, we need to loop through all pages to get the complete result set."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# OpenAIRE endpoint to query for publications\n",
+    "OPENAIRE_API_PUBLICATIONS = \"https://api.openaire.eu/search/publications\"\n",
+    "\n",
+    "# query OpenAIRE for all publications that are connected to orcid\n",
+    "def query_openaire_for_person2publications(orcid_id):\n",
+    "    page = 1\n",
+    "    max_page = 1\n",
+    "\n",
+    "    while page <= max_page:\n",
+    "        params = {'orcid': orcid_id, 'page': page, 'format': \"json\"}\n",
+    "        response = requests.get(url=OPENAIRE_API_PUBLICATIONS,\n",
+    "                                params=params)\n",
+    "        response.raise_for_status()\n",
+    "        result=response.json()\n",
+    "\n",
+    "        # calculate max page number in first loop\n",
+    "        if max_page == 1:\n",
+    "            max_page = determine_max_page(result)\n",
+    "        page = page + 1\n",
+    "        yield result\n",
+    "\n",
+    "# calculate max number of result pages\n",
+    "def determine_max_page(response_data):\n",
+    "    response_dict = benedict.from_json(response_data)\n",
+    "    items_total = response_dict.get('response.header.total.$')\n",
+    "    items_per_page = response_dict.get('response.header.size.$')\n",
+    "    max_page_ceil = items_total // items_per_page + bool(items_total % items_per_page)\n",
+    "    return max_page_ceil\n",
+    "\n",
+    "\n",
+    "# ---- example execution\n",
+    "list_of_pages=query_openaire_for_person2publications(example_orcid_id)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "From the resulting list of publications we extract and print out each title and DOI. \n",
+    "\n",
+    "*Note: publications that do not have a DOI assigned, will not be printed.*"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Number of publications found: 6\n",
+      "\n",
+      "10.15488/11463, Roadmap to FAIR Research Information in Open Infrastructures\n",
+      "10.1515/bd.2006.40.4.466, Informationsvermittlung: Personalisiertes Lernen in der Bibliothek: das Düsseldorfer Online-Tutorial (DOT) Informationskompetenz\n",
+      "10.1080/00048623.2006.10755322, Teaching Information Literacy with the Lerninformationssystem\n",
+      "10.3389/frma.2021.694307, Enhancing Knowledge Graph Extraction and Validation From Scholarly Publications Using Bibliographic Metadata\n",
+      "10.3897/rio.7.e66264, OPTIMETA – Strengthening the Open Access publishing system through open citations and spatiotemporal metadata\n",
+      "10.1016/j.procs.2019.01.074, The Research Core Dataset (KDSF) in the Linked Data context\n"
+     ]
+    }
+   ],
+   "source": [
+    "# from the result pages, extract the data about each publication\n",
+    "def extract_publications_from_page(page):\n",
+    "    return [pub for pub in benedict.from_json(page).get('response.results.result') or []]\n",
+    "\n",
+    "# extract DOI from publication\n",
+    "def extract_doi(pub):\n",
+    "    oaf_result=benedict.from_json(pub).get('metadata.oaf:entity.oaf:result')\n",
+    "\n",
+    "    # unfortunately the json data is inconsistently modeled:\n",
+    "    # if there is one pid/title for a publication, it is a json object\n",
+    "    # if there are multiple pids/titles for a publication, they form a json list\n",
+    "    pids=oaf_result.get('pid') or []\n",
+    "    is_doi = lambda pid: pid.get('@classid')==\"doi\"\n",
+    "    if isinstance(pids, list):\n",
+    "        dois=[pid['$'] for pid in pids if is_doi(pid)]\n",
+    "    else:\n",
+    "        dois= [pids['$']] if is_doi(pids) else []\n",
+    "    doi=dois[0] if dois else None # pick the first one\n",
+    "    \n",
+    "    titles=oaf_result.get('title') or []\n",
+    "    is_main_title = lambda title: title.get('@classid')==\"main title\"\n",
+    "    if isinstance(titles, list):\n",
+    "        main_titles=[title['$'] for title in titles if is_main_title(title)]\n",
+    "    else:\n",
+    "        main_titles=[titles['$']] if is_main_title(titles) else []\n",
+    "    title=main_titles[0] if main_titles else None  # pick the first one\n",
+    "\n",
+    "    return doi, title\n",
+    "\n",
+    "\n",
+    "#--- example execution\n",
+    "for page in list_of_pages or []:\n",
+    "    publications=extract_publications_from_page(page)\n",
+    "    print(f\"Number of publications found: {len(publications)}\\n\")\n",
+    "    for pub in publications:\n",
+    "        doi,title = extract_doi(pub)\n",
+    "        if doi:\n",
+    "            print(f\"{doi}, {title}\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}
diff --git a/work-projects/README.md b/work-projects/README.md
@@ -0,0 +1,6 @@
+## work-projects
+
+A Jupyter notebook showing an example of using a persistent identifier for a publication (DOI)
+as input for retrieving the project a work was produced in (identified by its OpenAIRE project ID).
+
+* [OpenAIRE](https://www.openaire.eu/)
diff --git a/work-projects/openaire_get_projects_by_work.ipynb b/work-projects/openaire_get_projects_by_work.ipynb