Semantic Scholar & CrossRef Data Enrichment and MySQL Loader

This project automates the process of retrieving scientific paper data from Semantic Scholar, enriching it with additional information from CrossRef, and storing it in a MySQL database.

Please, remember to respect copyright law in your country. Do not use this software to process abstracts without the necessary permission of rightholders!

Prerequisites

MySQL Server (locally or remotely, e.g., via XAMPP)
Python Libraries:
- pymysql
- pytest
- requests
- keyring
- dotenv
- tabulate
- colorama

Install the required packages:

pip install pymysql requests keyring python-dotenv tabulate colorama pytest

Installation and Configuration

Clone the repository

git clone <repository_url>
cd <project_directory>

Create a .env file in the project root with the following structure:

API_KEY="your Semantic Scholar API key"
SemSch_URL="https://api.semanticscholar.org/graph/v1/paper/search"
DB_NAME="your_database_name"
DB_password="your_database_password"
HOST="your_database_host"
USER="your_database_user"

Set up the MySQL database (locally or remotely using XAMPP or another tool).

Steps performed:

Retrieve Data – Downloads paper data from Semantic Scholar based on custom query parameters.
Enrich Data – Enriches the data with metadata from CrossRef (if available).
Prepare Data – Converts JSON data into a format compatible with MySQL.
Load Data to MySQL – Connects to the database, creates necessary tables, and inserts the enriched data.

You can customize the query parameters for Semantic Scholar in the main.py file (e.g., change query, fieldsOfStudy, etc.).

Optional functions (enabled in code):

View table contents and records in MySQL.
Update existing database records with new data.

Folder Structure and Key Modules

source/
- get_data.py – Retrieves data from Semantic Scholar.
- enrich_data.py – Enriches data with CrossRef metadata.
- make_data_compatible.py – Prepares JSON data for MySQL insertion.
- load_data_to_MySQL.py – Manages MySQL connections, table creation, and data insertion.
files/
- Contains intermediate JSON files: raw data, enriched data, compatible data.
.env – Environment file with API keys and database credentials.

Quick start!

from dotenv import load_dotenv
import os
from source import get_data as gd
from source import enrich_data as ed
from source import make_data_compatible as mdc
from source import load_data_to_MySQL as ldm

Load environment variables from .env

load_dotenv(dotenv_path="API_KEY.env")

Extract credentials and API details from environment

SemSch_Api = os.getenv('API_KEY')
SemSch_URL = os.getenv('SemSch_URL')
DB_name = os.getenv('DB_NAME')
DB_password = os.getenv('DB_password')
DB_host = os.getenv('HOST')
DB_user = os.getenv('USER')

Step 1: Get data from Semantic Scholar

query_params = {
    'query': "aesthetics",
    'isOpenAccess': True,
    'year': 2023,
    'limit': 20,
    'fieldsOfStudy': 'Philosophy',
    'fields': 'title,authors,abstract,externalIds,year,venue,url'
}
gd.get_data(SemSch_Api, SemSch_URL, query_params)

Step 2: Enrich data with CrossRef

ed.enrich_data(print_result=False)

Step 3: Convert data to MySQL-compatible format

mdc.process_data(print_process=False, testing=False)

Step 4: Connect to MySQL database

connection = ldm.connection(
    db=DB_name,
    password=DB_password,
    localhost=False,
    host=DB_host,
    user=DB_user
)

Step 5: Create table and insert data

ldm.create_database(
    con=connection,
    compatible_json_path='files/compatible_enriched_data.json',
    table_name="aesthetics_test_final",
    print_process=False
)

Optional: Print contents of the table

ldm.print_table_contents(connection, 'aesthetics_test_final', number=5)

Close connection

connection.close()

This project is open-source. Feel free to contribute or adapt it for your use case!

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
source		source
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Semantic Scholar & CrossRef Data Enrichment and MySQL Loader

Prerequisites

Installation and Configuration

Steps performed:

Folder Structure and Key Modules

Quick start!

Load environment variables from .env

Extract credentials and API details from environment

Step 1: Get data from Semantic Scholar

Step 2: Enrich data with CrossRef

Step 3: Convert data to MySQL-compatible format

Step 4: Connect to MySQL database

Step 5: Create table and insert data

Optional: Print contents of the table

Close connection

About

Uh oh!

Packages

Languages

License

cognitive-metascience/psychological_abstract_crawler

Folders and files

Latest commit

History

Repository files navigation

Semantic Scholar & CrossRef Data Enrichment and MySQL Loader

Prerequisites

Installation and Configuration

Steps performed:

Folder Structure and Key Modules

Quick start!

Load environment variables from .env

Extract credentials and API details from environment

Step 1: Get data from Semantic Scholar

Step 2: Enrich data with CrossRef

Step 3: Convert data to MySQL-compatible format

Step 4: Connect to MySQL database

Step 5: Create table and insert data

Optional: Print contents of the table

Close connection

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Packages 0

Languages

Packages