This project automates the process of retrieving scientific paper data from Semantic Scholar, enriching it with additional information from CrossRef, and storing it in a MySQL database.
Please, remember to respect copyright law in your country. Do not use this software to process abstracts without the necessary permission of rightholders!
- MySQL Server (locally or remotely, e.g., via XAMPP)
- Python Libraries:
pymysql
pytest
requests
keyring
dotenv
tabulate
colorama
Install the required packages:
pip install pymysql requests keyring python-dotenv tabulate colorama pytest
- Clone the repository
git clone <repository_url>
cd <project_directory>
- Create a
.env
file in the project root with the following structure:
API_KEY="your Semantic Scholar API key"
SemSch_URL="https://api.semanticscholar.org/graph/v1/paper/search"
DB_NAME="your_database_name"
DB_password="your_database_password"
HOST="your_database_host"
USER="your_database_user"
- Set up the MySQL database (locally or remotely using XAMPP or another tool).
- Retrieve Data – Downloads paper data from Semantic Scholar based on custom query parameters.
- Enrich Data – Enriches the data with metadata from CrossRef (if available).
- Prepare Data – Converts JSON data into a format compatible with MySQL.
- Load Data to MySQL – Connects to the database, creates necessary tables, and inserts the enriched data.
You can customize the query parameters for Semantic Scholar in the main.py
file (e.g., change query
, fieldsOfStudy
, etc.).
Optional functions (enabled in code):
- View table contents and records in MySQL.
- Update existing database records with new data.
-
source/
get_data.py
– Retrieves data from Semantic Scholar.enrich_data.py
– Enriches data with CrossRef metadata.make_data_compatible.py
– Prepares JSON data for MySQL insertion.load_data_to_MySQL.py
– Manages MySQL connections, table creation, and data insertion.
-
files/
- Contains intermediate JSON files: raw data, enriched data, compatible data.
-
.env
– Environment file with API keys and database credentials.
from dotenv import load_dotenv
import os
from source import get_data as gd
from source import enrich_data as ed
from source import make_data_compatible as mdc
from source import load_data_to_MySQL as ldm
load_dotenv(dotenv_path="API_KEY.env")
SemSch_Api = os.getenv('API_KEY')
SemSch_URL = os.getenv('SemSch_URL')
DB_name = os.getenv('DB_NAME')
DB_password = os.getenv('DB_password')
DB_host = os.getenv('HOST')
DB_user = os.getenv('USER')
query_params = {
'query': "aesthetics",
'isOpenAccess': True,
'year': 2023,
'limit': 20,
'fieldsOfStudy': 'Philosophy',
'fields': 'title,authors,abstract,externalIds,year,venue,url'
}
gd.get_data(SemSch_Api, SemSch_URL, query_params)
ed.enrich_data(print_result=False)
mdc.process_data(print_process=False, testing=False)
connection = ldm.connection(
db=DB_name,
password=DB_password,
localhost=False,
host=DB_host,
user=DB_user
)
ldm.create_database(
con=connection,
compatible_json_path='files/compatible_enriched_data.json',
table_name="aesthetics_test_final",
print_process=False
)
ldm.print_table_contents(connection, 'aesthetics_test_final', number=5)
connection.close()
This project is open-source. Feel free to contribute or adapt it for your use case!