Skip to content

Retrieves scientific article abstracts, applies tagging to categorize them, and uploads the results to your MySQL server.

License

Notifications You must be signed in to change notification settings

cognitive-metascience/psychological_abstract_crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Semantic Scholar & CrossRef Data Enrichment and MySQL Loader

This project automates the process of retrieving scientific paper data from Semantic Scholar, enriching it with additional information from CrossRef, and storing it in a MySQL database.

Please, remember to respect copyright law in your country. Do not use this software to process abstracts without the necessary permission of rightholders!

Prerequisites

  • MySQL Server (locally or remotely, e.g., via XAMPP)
  • Python Libraries:
    • pymysql
    • pytest
    • requests
    • keyring
    • dotenv
    • tabulate
    • colorama

Install the required packages:

pip install pymysql requests keyring python-dotenv tabulate colorama pytest

Installation and Configuration

  1. Clone the repository
git clone <repository_url>
cd <project_directory>
  1. Create a .env file in the project root with the following structure:
API_KEY="your Semantic Scholar API key"
SemSch_URL="https://api.semanticscholar.org/graph/v1/paper/search"
DB_NAME="your_database_name"
DB_password="your_database_password"
HOST="your_database_host"
USER="your_database_user"
  1. Set up the MySQL database (locally or remotely using XAMPP or another tool).

Steps performed:

  1. Retrieve Data – Downloads paper data from Semantic Scholar based on custom query parameters.
  2. Enrich Data – Enriches the data with metadata from CrossRef (if available).
  3. Prepare Data – Converts JSON data into a format compatible with MySQL.
  4. Load Data to MySQL – Connects to the database, creates necessary tables, and inserts the enriched data.

You can customize the query parameters for Semantic Scholar in the main.py file (e.g., change query, fieldsOfStudy, etc.).

Optional functions (enabled in code):

  • View table contents and records in MySQL.
  • Update existing database records with new data.

Folder Structure and Key Modules

  • source/

    • get_data.py – Retrieves data from Semantic Scholar.
    • enrich_data.py – Enriches data with CrossRef metadata.
    • make_data_compatible.py – Prepares JSON data for MySQL insertion.
    • load_data_to_MySQL.py – Manages MySQL connections, table creation, and data insertion.
  • files/

    • Contains intermediate JSON files: raw data, enriched data, compatible data.
  • .env – Environment file with API keys and database credentials.

Quick start!

from dotenv import load_dotenv
import os
from source import get_data as gd
from source import enrich_data as ed
from source import make_data_compatible as mdc
from source import load_data_to_MySQL as ldm

Load environment variables from .env

load_dotenv(dotenv_path="API_KEY.env")

Extract credentials and API details from environment

SemSch_Api = os.getenv('API_KEY')
SemSch_URL = os.getenv('SemSch_URL')
DB_name = os.getenv('DB_NAME')
DB_password = os.getenv('DB_password')
DB_host = os.getenv('HOST')
DB_user = os.getenv('USER')

Step 1: Get data from Semantic Scholar

query_params = {
    'query': "aesthetics",
    'isOpenAccess': True,
    'year': 2023,
    'limit': 20,
    'fieldsOfStudy': 'Philosophy',
    'fields': 'title,authors,abstract,externalIds,year,venue,url'
}
gd.get_data(SemSch_Api, SemSch_URL, query_params)

Step 2: Enrich data with CrossRef

ed.enrich_data(print_result=False)

Step 3: Convert data to MySQL-compatible format

mdc.process_data(print_process=False, testing=False)

Step 4: Connect to MySQL database

connection = ldm.connection(
    db=DB_name,
    password=DB_password,
    localhost=False,
    host=DB_host,
    user=DB_user
)

Step 5: Create table and insert data

ldm.create_database(
    con=connection,
    compatible_json_path='files/compatible_enriched_data.json',
    table_name="aesthetics_test_final",
    print_process=False
)

Optional: Print contents of the table

ldm.print_table_contents(connection, 'aesthetics_test_final', number=5)

Close connection

connection.close()

This project is open-source. Feel free to contribute or adapt it for your use case!

About

Retrieves scientific article abstracts, applies tagging to categorize them, and uploads the results to your MySQL server.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages